Skip to main content

Full text of "Population 1"

See other formats

Daniel L. Hortt 

Harvard University 

Andrew G. Clark 

Pennsylvania State University 

Principles of 
Population Genetics 


raS^nauer Associates, Inc. Publishers 
Sunderland, Massachusetts 


This image represents data from the first study of nucleotide sequence variation in a 
natural population, conducted by Martin Kreimian (1983, Nature 304 p 412-417). Each 
of the II vertical bands represents the 43 varying nucleotide sites in one allele of the 
gene alcohol dehydrogenase taken from a global distribution of the fruit fly Drosopiiila 
mrlaiwgtisler. The colors correspond lo the bases at each site: adenine = green, cyto- 
suie = yellow, guanine = blue, and thymine = red Note that at each position only two 
different bases were observed Blocks of sites in strong linkage disequilibrium can 
also he seen as repeated patterns of color The sequences are oriented with the 5' end 
at the top, and the nucleotide corresponding to the fast/slow difference in the ADH 
protein is the twelfth one from the bottom. 

To Barbara and Christine 

Copynghtp 1997 by Smauer Associates, Jnc All rights reserved. 
This hook may not be reproduced in whole or in part without permission 
from the publisher 

For information or to order, address 

Sinauer Associates, Inc , PO Box 407 

23 Plumtiee Road, Sunderland, MA, 01375 U S A. 

FAX 413-5*9-] 118 

Internet: publish ©sinauer com; http-// 

Library of Congress Cataloging-in-Publication Data 

Hard, Daniel L 

Principles of population genetics / Daniel L Hart I, Andrew C 
Clark — 3rdcrt. 

p. cm 

Includes bibliographical references and index 

ISBN 0-87893-306-9 (hardcovei) 

I Population genetics 2 Quantitative genetics. 3 Population 
genetics — Problems, exercises, etc. 1 Clark, Andrew G , 1954- 
f] Title 

QH4.H5.H3r) 1997 
576 5'8— dc21 


Printed in Canada 

5 4 3 2 1 

Table of Contents 



Gene Expression and Gene Interaction 2 

Gene Expression 3 

The Genetic Code 6 

Alleles 8 

Genotype and Plienoh/pe 8 

Dominance ami Gene In tc taction H 

Segregation and Re -jnibinatwn 12 
Probability in Population Genetics 15 

The Addition Rule 16 

The Multiplication Ride 16 

Repeated Trials 17 
Phenotypic Diversity and Genetic 
Variation 20 

Allele Frequencies in Populations 20 

Parameters and Estimates 22 

The Standard Erroi of an Estimate 12 
Models in Population Genetics 26 

Exponential Population Growth 27 

Logistic Population Grmvih 31 
Summary 33 
Problems 34 



Phenotypic Variation in Natural 
Populations 37 

Continuous Variation: The Normal Distrib- 
ution 38 
Mean and Variance 39 

Centra! Limit Theorem 41 
Discrete Mendcliau Variation 43 
Experimental Methods for Detecting 
Genetic Variation 44 

Protein Electrophoresis 45 
The Southern Blot Procedure 48 
The Polymerase Chain Reaction 51 

Polymorphism and Heterozygosity 53 
Allozyme Polymorphisms 54 
How Representative A re Allozymes? 56 
Polymorphisms in DNA Sequences 57 
Nucleotide Polymorphism and Nucleotide 

Diversity 57 
Uses of Genetic Polymorphisms 62 

Multiple-Factor Inheritance 64 

Summary 66 

Problems 68 


Random Mating 72 

Nonoverlapping Generations 73 
The Hardy-Weinberg Principle 74 
Random Mating of Genotypes versus 

Random Union of Gametes 76 
Implications of the Hardy-Weinberg 

Principle 79 
The Hardy-Weinberg Principle in 

Operation SO 
Compficat ions of Dominance 84 
Frequency of Heterozygotes 87 
Special Cases of Random Mating 88 
Three or More Alkies 88 
X-Lmked Genes 92 

Table of Contents vii 

Linkage and Linkage Disequilibrium 

Summary 106 
Problems 107 





Hierarchical Population Structure 111 

Reduction in Heterozygosity 112 

Average Heterozygosity 114 

Wright's F Statistics 117 

Genetic Divergence among SuhpopuMions 
Isolate Breaking: The Wahlund Principle 

Wahlund s Principle and the Fixation Index 

Genotype Frequencies in Subdivided Popu- 
lations 127 
Population Genetics in DNA Typing 

Polymorphisms Based on a Variable Num- 
ber of Tandem Repeats {VNTR) 129 

Match Probabilities with Hardy-Weinberg 
Equilibrium and Linkage Equilibrium 

Effects of Population Substructure 132 
Inbreeding 135 

Genotype Frequencies wtih Inbreeding 

Relation between the Inbreeding Coefficient 
and the F Statistics 139 

The Inbreeding Coefficient as a Probability 

Genetic Effects of Inbreeding 145 

Calculation of the Inbreeding Coefficient 
from Pedigrees 149 

Regular Systems of Mat trig 1 53 
Assortative Mating 155 
Summary 158 
Problems 159 

Mutation 163 

Irreversible Mutation 164 

Reversible Muialiou 168 

Probability of Fixation of a New Neutral 
Mutation 170 

The hifintte-Alleles Model 1 74 

Neutral Mutations 177 
Linkage and Recombination 180 

Piesumed Evoluhonan/ Benefit of Recombi- 
nation 181 

Recombination and Polymorphism Wl 

Piecewise Recombination in Bacteria 186 

Absence of Recombination in Animal 
Mitochondrial DNA 187 
Migration 189 

One-Way Migration 189 

The Island Mode! of Migration 192 

How Migration Limits Genetic Diveigence 

Estimates of Migration Rates 196 

Patterns of Migration 196 
Transposable Elements 198 

Factois Controlling the Population Dyna- 
mics of Transposable Elements 200 

Insertion Sequences and Composite Tnuis- 
posous in Racier ta 200 

Transposable Elements in Eitkaryotcs 204 

Horizontal Transmission of Transposable 
Elements 204 
Summary 206 
Problems 208 


Selection in Baploid Organisms 212 

Discrete Geueiations 212 

Continuous Tune 216 

Change in Allele Frequent i/ in Haploids 


viii Table of Contents 

Darwinian Fitness and Malthusian Fitness 
Selection in Diploid Organisms 218 

Change in Allele Frequency in Diploids 

Time Required for a Given Change in AlJele 

Frequency 222 
Application to the Evolution of Insecticide 

Resistance 226 

Equilibria with Selection 227 

Overdominance 228 

Local Stability 232 

Heterozygote Inferiority 234 

The Adaptive Topography and the Role of 
Random Genetic Drift 236 
Mutation-Selection Balance 236 

Equilibrium Allele Frequencies 237 

The Haldane-Muller Principle 239 
More Complex Types of Selection 240 

Frequency-Dependent Selection 240 

Density-Dependent Selection 241 

Fecundity Selection 241 

Age-Structured Populations 242 

Heterogeneous Environments and dines 

Diversifying Selection 244 

Differential Selection in the Sexes 246 

X-lmked Genes 246 

Gametic Selec t io n 246 

Meiotie Drive 247 

Multiple Alleles 250 

Multiple Loci and Gene interaction: 
Epistasis 252 

Sexual Selection 255 
Kin Selection 256 
Interdeme Selection and the Shifting 

Balance Theory 259 
Summary 262 
Problems 264 


Random Genetic Drift and Binomial 

Sampling 267 
The Wright-Fisher Model of Random 

Genetic Drift 274 
The Diffusion Approximation 277 

Absorption Time and Time to Fixation 2S2 
Parallelism between Random Drift and 

Inbreeding 283 
Effective Population Size 289 
Fluctuation in Population Size 290 
Unequal Sex Ratio, Sex Chromosomes, 
Organelle Genes 292 

Balance between Mutation and Drift 

Infinite Alleles Model 294 

The Ewens Sampling Formula 196 
The Ewcus-Wallersou Test 298 

Infinite-Sites Model 300 

Gene Trees and the Coalescent 304 
Coalescent Models with Mutation 308 

Summary 310 

Problems 312 


The Neutral Theory and Molecular 
Evolution 315 

Theoretical Principles of the Neutral Theory 
Estimating Rates of Molecular Sequence 
Divergence 320 

Rates of Ammo Acid Replacement 320 
Rates of Nucleotide Substitution 324 
Other Measures of Molecular Divergence 

The Molecular Cluck 328 

Variation across Genes in the Rate of the 
Molecular Clock 331 

Table of Contents ix 

Variation across Lineages in Clock Rate 

The Generafion-Time Effect 336 
Does the Constancy of Substitution Rales 

Prove the Neutral Theory? 337 
Patterns of Nucleotide and Amino Acid 

Substitution 338 
Calculating Synonymous and Nonsynony- 

mous Substitution Rates 33H 
Within-Speaes Polymorphism 345 
Implications ofCodon Bias 348 
Polymorphism and Divergence in 

Nucleotide Sequence Data 349 
Impact of Local Recombination Rates 353 
Gene Genealogies 354 

Hypothesis Testing Using Trees 356 
Inferences about Migration Based on Gene 

Trees 360 
Mitochondrial and Chloroplast DNA 

Evolution 361 
Chloroplast DNA and Organelle 

Transmission in Plants 365 
Maintenance of Variation in Organelle 

Genomes 366 
Evidence for Selection in mtDNA 367 
Molecular Fhylogenetics 368 
Algorithms for Phytogenetk Tree 

Reconstruction 368 
Distance Methods versus Parsimony 372 
Bootstrapping and Statistical Confidence 

in a Tiee 372 
Shared Polymorphism 373 
Interspecific Genetics 374 
Multigene Families 374 

Causes of Concerted Evolution 375 
Multigene Family Evolution through a 

Bnth and Death Process 378 
Structural RAM Genes and Compensatory 

Substitutions 382 
Multigene Stipe rfam il ies 383 

Dispersed Highly Repetitive DNA 
Sequences 38 r y 
Summary 390 
Problems 392 

Types of Quantitative Traits 398 
Resemblance between Relatives and 

the Concept of Heritability 400 
Artificial Selection and Realized 
Heritability 406 
Prediction Equation for Individual 

Selection 407 
Selection Limits 411 
Genetic Models for Quantitative Traits 
Change in Gene Frequency 421 
Genetic Model for the Change in Mean 
I'henotype 423 
Components of Phenotypic Variance 424 
Genetic and Environmental Sources of 

Variation 425 
Companen ts of Geuotypic Vanat ion 430 
Covariance between Relatives 434 
Twin Studies and Inferences of Heritability 
in Humans 440 

Experimental Assessment of Genetic 
Variance Components 442 

Indirect Estimation of the Number of Genes 
Affecting a Quantitative Character 445 

Norm of Reaction and Phenotypic 

Plasticity 448 
Threshold Traits and the Genetics of 

Liability 452 
Correlated Response and Genetic 
Correlation 454 
Inference of Selection from Phenotypic 
' Data '458 
Evolution of Quantitative Traits 460 

x Table of Contents 

Random Genetic Drift and Phenotypic 
Evolution 461 

Mutation-Selection Balance 465 
Quantitative Trait Loci 467 

Mapping Genes thai Influence 

Qiiantilatiiv Cliaiacters 4-67 
Significance Testing of QTLs 470 
Composite Infernal Mapping and 

Other Refinements 471 
What Have Wc Learned from Mapping 
QTLs? 473 
Summary 476 
Problems 479 







Thanks in part to the power of molecular 
methods, population genetics has been rein- 
vigorated As some genome projects are 
approaching closure and methods of "func- 
tional genomics" are scaling up to identify 
the roles of novel genes, inevitably increas- 
ing attention is being paid to the significance 
of genetic variation in populations. Nowhere 
is this more evident than in medical genetics. 
Within a decade we can expect that all major 
single-gene inherited disorders will be iden- 
tified, genelically mapped, cloned, and char- 
acterized at a fine molecular level. Health 
professionals realize that this impressive feat 
will have an impact only on a small minority 
of individuals. Most of the genetic variation 
in disease risk is multifactorial, which means 
that the risk is determined by multiple 
genetic and environmental factors acting 
together. Killer diseases such as lamilial 
forms of cancer, diabetes, and cardiovascular 
disease fall into this category. The fact that 
these diseases aggregate in families implies 
that there is probably a genetic component, 
but the genetic component may differ from 
one family or ethnic group to another. 
Prompted by the high incidence of multifac- 
torial diseases as a group, the medical com- 
munity has become acutely aware of the 
need to understand the basic structure of 
genetic variation in populations in order to 
determine what aspects of the variation 
cause disease. 

The exciting practical applications ol 
population genetics to the analysis of multi- 
factorial diseases have received great atten- 

tion, but the scope of population genetics 
actually is much broader. Population genet- 
ics provides the genetic underpinning for all 
ot evolutionary biology. By "evolution" we 
mean descent with modification. Species 
undergo progressive genetic modification as 
they adapt to their environments, and new 
species arise as a by-product of this process. 
The intellectual excitement of biological evo- 
lution arises from the fact that it addresses 
the fundamental questions, "What are we?" 
and "Where did we come irom 7 " 

Patterns of evolutionary history are 
recorded in DN A sequences, and the appli- 
cation of population genetics to interpreting 
DNA sequences is revealing many secrets 
about the evolutionary past, including Ihe 
history of our own species. But population 
genetics embraces much more than the 
analysis of evolutionary relationships. It is 
particularly concerned with Ihe processes 
and mechanisms by which evolulionaiy 
changes are made The field is inherently 
rniiltidisciplinaiy, cutting across molecular 
biology, genetics, ecology, evolutionary biol- 
ogy, systematica, natural history, plant 
breeding, animal breeding, conservation 
and wildlife management, human genetics, 
sociology, anthropology, mathematics, and 

Students faking population genetics are 
usually expected to have completed, or to be 
taking concurrently, a course in differential 
calculus. While this book assumes a famil- 
iarity with the elementary notation for dif- 
ferentials and integrals, it does not require 

xii Preface 

great mathematical proficiency. We have 

kept the mathematics to a minimum. On the 
other hand, some ol the most important 
models in population genetics require quite 
advanced mathematics. Rather than ignore 
these approaches, we have made a concert- 
ed effort to present these models in such a 
way that the assumptions can be under- 
stood and the main results appreciated 
without much mathematics. References are 
provided for the interested reader to learn 
more about the details. 

Several important changes distinguish 
the third edition of Principles from the sec- 
ond edition. The level ol the treatment is 
more tailored to the needs of a one-semester 
or one-quarter course, with the intended 
audience being third- and fourth-year 
undergraduates as well as beginning gradu- 
ate students. Population genetics is not only 
an experimental science but also a theoreti- 
cal one. Special care has been taken to 
explain the biological motivation behind the 
theoretical models so that the models do not 
simply materialize out 0/ thin air, and to 
explain in plain English the implications of 
the results. Many concepts are illustrated by 
numerical examples, using actual data wher- 
ever possible. Special topics and examples 
are often set off from the text as boxed prob- 
lems whose solutions are explained step by 
step. Every chapter ends with about 20 prob- 
lems, graded in difficulty, and solutions 
worked in full appear at the end of the text. 
This edition of Principles is organized 
into nine chapters that gradually build con- 
cepts from measuring variation and the var- 
ious forces that influence genetic variation 
through a sequential progression to concepts 
of molecular population genetics and quan- 
titative genetics. The first chapter provides a 
background in basic genetic and statistical 
principles. We discuss the fundamental con- 
cepts of allelism, dominance, segregating. 

recombination, and population frequencies. 
The role of model building and testing in 
population genetics is emphasized. Chapter 
2 introduces the student to the primary data 
of population genetics, namely, the many 
levels of genetic variation. Chapter 3 is con- 
cerned with the organization of genetic vari- 
ation into genotypes in populations. Here 
the Hardy- Weinberg principle gets very 
thorough coverage, including the cases of X- 
linkage and multiple alleles. Chapter 4 
widens the perspective and considers the 
organization of generic variation among spa- 
tially structured populations, Population 
substructure is measured by Wright's F sta- 
tistics, and is presented in a way that con- 
veys their biological meaning. The Wahlund 
principle and inbreeding are also covered m 
Chapter 4 

The goal of population genetics is to 
understand the forces that have an impact 
on levels of genetic variation. The forces of 
mutation, recombination, and migration are 
outlined in Chapter 5. Darwinian selection is 
the topic of Chapter 6, including both the 
theoretical foundations and empirical obser- 
vations of the dynamics of gene-frequency 
change under the action of selection. Hap- 
loid and diploid cases are developed, as are 
the concepts of equilibrium, stability, and 
context dependence. After classical models 
of mutation-selection balance are developed, 
a series of more complex scenarios of natural 
selection are presented. 

Chapter 7 deals with random genetic 
drift. In the absence of other forces, allele 
and genotype frequencies change as a result 
of random sampling from one generation to 
another. The Wright-Fisher model and d iffu- 
sion approximations are presented in such a 
way that the student gains an appreciation 
for the importance of random genetic drift. 
The process of the coalescence of genealogies 
is an important innovation in theoretical 

Prelace xiii 

population genetics, and some of the basic 
concepts of coalescence are presented in 
Chapter 7. 

In Chapter 8 we cover the rapidly ex- 
panding data on molecular evolutionary 
genetics. The unifying theme in the study of 
molecular evolution is Kimura's neutral the- 
ory, and a close examination is made of the 
correspondence between the data and theo- 
ry. This is a held in which advances in our 
empirical database and statistical tools for 
quantifying and manipulating the data are 
growing at a dizzying pace. Our goal is to 
give the student a firm grasp of the funda- 
mentals, and a deep enough understanding 
of the principles to identify important gaps 
in our knowledge. One intriguing aspect of 
molecular evolutionary genetics is the dis- 
covery of new phenomena and forces taking 
place at the molecular level that go beyond 
the realm of classical population genetics. 
Multigene families and organelle genomes 
are described in some detail to illustrate 
these uniquely molecular phenomena 

Chapter 9 covers the problem of quanti- 
tative genetics from an evolutionary perspec- 
tive. A compelling argument for using quan- 
titative genetics for the study of evolution is 
that adaptive evolution takes place at the 
level of the phenol ype, and quantitative 
genetics provides the tools for understanding 
transmission of phenotypic traits. Theoretical 
quantitative genetics is given special impor- 
tance by the paradoxes it raises in contrasting 
evolution at the levels ol the phenotype and 
of the DNA sequence. Our understanding of 
the correspondence between phenotypic and 
molecular differentiation is very incomplete, 
and our understanding of the correspon- 
dence between the rates ol morphological 
and molecular evolution is even less well 
developed As in the preceding chapters, we 
hope that the student is left with a feeling 
that there is plenty of room for imaginative 

work in this area Population genetics is a 
field with a bright and expanding future. 


This book was greatly improved by the 
efforts of many people. The staff at Sinauer 

Associates did a splendid |ob assisting us 
with the revision Nan Sinauer kept us on 
track, collecting and assembling dozens of 
computer files, revisions, FAXes, phone and 
email messages Chris Small oversaw the 
page layout and managed the art program. 
Andy Sinauer played an essential role in 
having the book reviewed and in giving 
helpful advice as to level and length. We are 
grateful to Chip Aquadro, James Jacobson, 
Trudy Mackay, Roger Milkman, Tim Prout, 
Glenys Thomson, and Ken Weiss for their 
comments on the previous edition. Their 
insights greatly improved the presentation 
in this one. 

Neither author could participate in writ- 
ing a book such as this without a support- 
ive, patient, sympathetic laboratory staff, 
able and willing to keep things running 
smoothly while the boss is at his word- 
processor doing a neuronal fusion with a sil- 
icon chip. We arc grateful to all of them. In 
Dan Hartl's laboratory, the list includes Lara 
Forde, Elena Lo/ovskaya, Dmitry Nurmin- 
sky, E. Fidelma Boyd, Allan Lohe, Javare 
Nagaraju, David Sullivan, Charles Hill, 
Dmitri Petrov, Mark Siegal, Daniel De 
Aguiar, Carlos Bustamante, Jeffrey Town- 
send, Jorges Vieira, Christina Vieira, Isabel 
Beerman, Yunsun Nam, Elizabeth Stover, 
and Susan Yuknis In Andy Clark's laborato- 
ry the acknowledgments include Michael 
Abraham, Joe Canalc, Manolis Dernntzakis, 
Chi is Fucito, Cnsfina Gonzalez, Jen Ionian, 
Angela Lambert, Brian La/zaro, |.P Masly, 
Flamish Spencei, Sarah Tishkoff, Bridget 
Todd, Can ic Tupper, and Lei Wang. 


Genetic and Statistical Background 

Genes • Gene Expression 

Standard Error * Models 

Probability • Allele Frequency Estimates 
Population Growth 


he science of population genetics deals with Mendel's laws and 
other genetic principles as they affect entire populations of organ- 
isms The organisms may be human beings, animals, plants, or 
microbes. The populations may be natural, agricultural, or experimental. The 
environment may be city, farm, field, or lorest. The habitat may be soil, 
water, or air. Because of its wide-ranging purview, population genetics cuts 
across many fields of modern biology. A working knowledge has become 
essential in genetics, evolutionary biology, systematks, plant breeding, ani- 
mal breeding, ecology, natural history, forestry, horticulture, conservation, 
and wildlife management. A basic understanding of population genetics is 
also useful in medicine, law, biotechnology, molecular biology, cell biology, 
sociology, and anthropology. 

Population genetics also includes the study of the various forces that 
result in evolutionary changes in species through time. By defining the 
framework within which evolution takes place, the principles of population 
genetics are basic to a broad evolutionary perspective on biology. From an 
experimental point of view, evolution provides a wealth of testable hypothe- 
ses for all other branches of biology. Many oddities in biology become com- 
prehensible in the light of evolution: they result from shared ancestry among 
organisms, and they attest to the unity of life on earth. 

Practical applications of population genetics are extensive Many applica- 
tions, particularly those relevant to human beings, also have important 

2 Chapter 1 

implications in elhics and social policy. Among the applications of population 
genetics in medicine, agriculture, conservation, and research are: 

• Genetic counseling of parents and other relatives of patients with heredi- 
tary diseases. 

• Genetic mapping and identification of genes for disease susceptibility in 
human beings, including breast cancer, colon cancer, diabetes, schizo- 
phrenia, and so forth. 

• Implications of population screening for carriers of disease genes, confi- 
dentiality of results, and maintenance of health insurability. 

• Studies of the heritability of 1Q score and its implications for affirmative 
action, welfare, and other social programs. 

• Statistical interpretation of the significance of matching DNA types found 
between a suspect and a blood or semen sample from the scene of a 

• Design of studies to sample and preserve a record of genetic variation 
among human populations throughout the world. 

• Improvement in the performance of domesticated animals and crop 

• Organization of mating programs lor the preservation of endangered 
species in zoos and wildlife refuges. 

• Sampling and preservation of germ plasms of potentially beneficial 
plants and animals that may soon vanish from the wild 

• Interpretation of differences in the nucleotide sequences of genes or 
amino acid sequences of proteins among members of the same or closely 
related species 

The genetic and statistical principles underlying population genetics are 
for the most part simple and straightforward, but it may be helpful to preface 
the discussion with a few key definitions and concepts 


Gene is a general term meaning, loosely, the physical entity transmitted 
from parent to offspring in reproduction that influences hereditary traits. 
Genes influence human traits such as hair color, eye color, skin color, height, 
weight, and various aspects of behavior — although most of these traits are 
also influenced more or less strongly by environment. Genes also determine 
the makeup of proteins such as hemoglobin, which carries oxygen in the red 
blood cells, or insulin, which is important in maintaining glucose balance in 
the blood. Genes can exist in different forms or states. For example, a gene 
for hemoglobin may exisl in a normal form or in any one of a number of 
alternative forms thai result in hemoglobin molecules that are more or less 
abnormal. These alternative forms of a gene are called alleles 

Genet tc and Statistical Background 3 

From a biochemical point of view, a gene corresponds to a region along a 
molecule of DNA (deoxyribonucleic acid) DNA is the genetic material A 
molecule of DNAconsisIs of two strands wound around each other in the 
form of a right-handed helix (the celebrated "double helix") Each strand is a 
polymer of constituents called nucleotides, ol which there are four, conven- 
tionally symbolized A, T, G, and C according to the mtrogen-nch base that 
each contains — either adenine (A), thymine (T), guanine (G), or cylosme (C). 
The paired strands are held together by weak chemical bonds (hydrogen 
bonds) that form between A and T at corresponding positions in opposite 
strands or between G and C at corresponding positions in opposite strands 
(Figure 1.1). Wherever one strand contains an A, the other across the way 
contains a T; and wherever one strand contains a G, the other across the way 
contains a C. Because of the pairing of complementary bases — A with T and 
G with C — a double-stranded DNA molecule contains an equal number of A 
and T nucleotides as well as an equal number of G and C nucleotides DNA 
molecules can be very long. The DNA molecule in the bacterium E. coh is 
about 4.7 million base pairs, that in the largest chromosome in the fruit fly 
Drosophiln melatjogaster is about 65 million base pairs, and that in the largest 
human chromosome is about 230 million base pairs. Physical manipulation 
ol such large molecules is impractical In order to be studied, they must first 
be broken into smaller pieces. 

Gene Expression 

Most genes code for the polypeptide chains that constitute proteins. The 
code is the sequence of nucleotides along the DNA. In the decoding of Ihe 
nucleotide sequence in DNA and also in the synthesis of proteins, several 

Figure 1.1 Genes are fundamental units of genetic information thai corre- 
spond chemically to the sequence of nucleotides in a segment of DNA A mole- 
cule of duplex DNA is composed of two intertwined strands, each of which 
consists of a long sequence of nucleotides The strands are held together by pair- 
ing between the bases A and T in opposite strands and between the bases G and 
C in opposite stiands, The short diagonal lines indicate the paired bases There 
are 10 base pairs per turn of the double helix. A typical gene consists of hun- 
dreds of thousands of nucleotides, only a few of which are shown here 

4 Chapter 1 

types of RNA (ribonucleic acid) are essential RNA is also a polymer of 
nucleotides, each of which carries a base Three of the bases m RNA (A, C, 
and G) are the same as those in DNA. The fourth [uracil (U)l is different, 
When an RNA strand pairs with a complementary strand of DNA, U in the 
RNA pairs with A in the DNA. Hence, the base-pairing role of U in RNA is 
(he same as that of T in DNA. 

The essentials of gene expression in the cells of higher organisms 
(eukaryotes) are outlined in Figure 1 2. The coding regions of the DNA in a 

Coding region 


Coding region 2 

(A) transcription 



RNA transept 


(B) RNA processing 

(C) Translation 

3' + 

Messenger RNA 



Excised intron 

-Phc-His-Lys-Arg Ser-Ser-Pro-Tyr- 

Figure 1 .2 Processes in gene expression in euknryotic cells. (A) DNA regions 
coding for the ammo acids in a single polypeptide can be interrupted by non- 
coding regions (mtrons) (B) When the DNA is copied into RNA in transcription, 
both coding and noncoding regions are transcribed. However, the introns are 
removed from the transcript by processing. (C) In the messenger RNA, the cod- 
ing regions aie contiguous The messenger RNA is translated to form the chain 
ot linked amino acids constituting the polypeptide 

Genetic and Statistical Background S 

gene, which code for ammo acids, are often interrupted by one or more non- 
coding regions known as intervening sequences or introns In the first step m 
gene expression (transcription), a molecule of RNA is produced thai is com- 
plementary in base sequence to one of the strands of DNA (Figure 1 2A). 
Every gene includes a regulatory region (sometimes more than one) that 
determines when transcription takes place, the types of cells in which it takes 
place, and the strand that is to be transcribed. Because of the base pairing 
rules, a DNA sequence — say, 3-ATCG-5' — results in a complementary RNA 
sequence — in this example, 5'-UAGC-3'. Mote that the DNA and RNA 
strands each have a polarity or directionality. The terms 5' and 3' refer to the 
polarity of the strands. The 5' end typically terminates with a free phosphate 
group and the 3' end typically terminates with a free hydroxyl group ( — OH) 
When two strands of nucleic acid are paired, the polarity of each strand is 
opposite to that of the other. In the duplex DNA in Figure 1-2, for example, 
the left-to-right polarity of one strand is5'-to-3', whereas the left-to-right 
polarity of the partner strand is 3'-to-5'. Similarly, in transcription, the tem- 
plate DNA strand has a Jeft-to-right polarity of 3'-to-5', whereas the RNA 
transcript has the left-to-right polarity ol 5'-to-3'. Because of the complemen- 
tary base pairing between DNA and RNA nucleotides, the base-sequence 
code in DNA becomes converted into a base-sequence code in RNA. In tran- 
scription, the base sequence present in the introns is also faithfully copied 
into the base sequence of the RNA transcript. 

The second step in gene expression in eukaryotes is RNA processing 
(Figure 1.2B), The beginning and end of the RNA transcript are chemically 
modified and the introns are removed by splicing (cutting and rejoining). RNA 
processing results in a molecule called messenger RNA (in RNA), in which the 
coding regions have been made contiguous. The regions in the original RNA 
transcript that are retained in the mature rnRNA are called exons. The central 
part of the mRNA contains the spliced exons that code lor the amino acid 
sequence of a polypeptide chain. The mRNA also includes exons upstream and 
downstream from the protein- coding region. The upstream region is the 5' 
untranslated region and the downstream region is the 3' untranslated region 

The final step in gene expression is translation, in which the mRNA mol- 
ecule combines with ribosornes and other types of RNA molecules in the 
cytoplasm to produce the final polypeptide (Figure 1.2C). In the coding 
region of the mRNA, each adjacent group of three nucleotides constitutes a 
separate coding group or codon that specifies which amino acid is to be 
incorporated into the polypeptide chain. The ribosome moves along the 
mRNA in steps of three nucleotides (codon by codon). As each new codon 
comes into place, the correct amino acid is brought into line and attached to 
the end of the growing chain of amino acids. New amino acids are added to 
the growing chain until a codon specifying "stop" is encountered. At this 
point synthesis of the chain of amino acids is finished and the polypeptide is 
released from the ribosome 

6 Chapter 1 

In prokaryotes, which includes bacteria and other organisms lacking a 
nucleus, gene expression is essentially identical to that in eukaryotes except 
for the absence of RNA processing Genes in prokaryotes do not contain 
introns and so splicing is unnecessary In prokaryotes, the original RNA tran- 
script is used immediately as mRNA and translated into a polypeptide. 
Because there is no separate nucleus, translation in prokaryotes often begins 
immediately when the 5' end of an RNA transcript comes off the DNA and 
even before transcription of the 3' end of the same molecule has been 

The central role of RNA in gene expression is one of the oddities of biolo- 
gy that makes sense in the light of evolution. That gene expression is config- 
ured around RNA is a legacy of the earliest forms of life when RNA 
molecules served both as carriers of genetic information and as catalytic mol- 
ecules. The role of RNA as carrier of genetic information was gradually 
replaced by DNA, and the role of RNA as catalytic molecules was gradually 
replaced by proteins. At every step along the way, as the RNA world evolved 
into the DNA world, the role of RNA was indispensable in the processes of 
information transfer and protein synthesis, and so the RNA intermediates 
became locked in place. 

The Genetic Code 

The genetic code is the list of all codons showing which amino acid each 
codon specifies. Table 1.1 shows the standard genetic code used in nuclear 
genes in most organisms. A few organisms and some cellular organelles, 

such as mitochondria, use slightly altered codes. The codons in Table 1.1 are 
those found in the mRNA. The amino acids are given by three-letter abbre- 
viations as well as by conventional single-letter abbreviations. Codon AUG 
is the start codon in polypeptide synthesis; it specifies methionine (Met) at 
the beginning of the polypeptide as well as at internal positions. Three 
codons are stops that result in termination of polypeptide synthesis: UAA, 
UAG, and UGA. The genetic code is redundant in that most amino acids are 
specified by more than one codon. Most of the redundancy is in the third 
codon position. 

A code for an amino acid is twofold degenerate if either of two sequences 
specifies the same amino acid. Twofold degenerate codes have the pattern -Y 
or R, where ■• stands for the bases m codon positions I and 2. The symbol Y 
stands for any pyrimidine base (either U or C); the symbol R stands for any 
purine base (either A or G) For example, CAU and CAC both code for histi- 
dine (His), fitting the pattern CAY; and CAA and CAG both code for gluta- 
mme (Gin), fitting the pattern CAR. A code for an amino acid is fourfold 
degenerate if any of four sequences specifies the same amino acid, fourfold 
degenerate codes have the form ■ N, where N means any nucleotide (U, C, A, 
or G). For example, GUU, GUC, GUA, and GUG all code for valine (Val), 

Genetic and Statistical Background 7 


Second nucleotide in codon 


Leu (L) 

S A 


UUA1. , M 

UUG/ Leu(L) 






AUG Met(M») 



Val (V) 











Pro (P) 

Thr (T) 

Ala (A) 

}t V i (Y) 

UAA Stop 
UAC Stop 

« u }h, s( h, 



Gin (Q) 


K JLys(K) 

GAU 1. , n , 



Glu (G) 

UGU \„ ._. 

ucc l Cys(C) 

UGA Stop 
UGG Trp(W) 



Ars (R) 

AGU l c .„. 

AGC J Ser(S) 






Gly (G) 

Note Codons are nonoverlapping three-base sequences present in mRNA, each of which spec- 
ifies an amino acid in a polypeptide chain or terminates synthesis ("Stop") 1 lie full names of 
the amino acids are phenylalanine (Phe), leucine (Leu), isoleucme (He), methionine (Met), 
valine (Val), serine (Ser), proline (Pro), threonine (Thr), alanine (Ala), tyrosine (Tyr), histidme 
(His), glutamine (Gin), asparagine (Asn), lysine (Lys), aspartrc aud (Asp), glutamic acid (Glu), 
cysteine (Cys), tryptophan (Trp), arginine (Arg), and glycine (Gly). 

which fits the pattern GUN, Note in 'Table 1 1 that the code for isoleucine is 
threefold degenerate and those for leucine, arginine, and serine are each sixfold 

The codons for amino acids are not used randomly in proteins. There are 
preferred codons for amino acids that differ from one gene to the next and 
from one organism to another. Codon preferences exist even within redun- 
dancy classes. In Drosophila, for example, among codons for histidme, CAC is 
used more than CAU in a ratio of about 2 • 1. Similarly, among codons for 
glutamine, CAG is used more than CAA in a ratio of about 3 ■ 1 Anothei 
example of non random codon usage is the AUA codon for isoleucme, which 
tends to be avoided in most proteins in most organisms In Drosophi/a, AUU 
and AUC are used more than AUA in a ratio of about 10:1 One evolutionary 

8 Chapter 1 

hypothesis that explains the moidance of AUA is that, because of the degen- 
eracy of the genetic code, the AUA end on might sometimes he translated as 
AUG, which codes for methionine. Because methionine is likely to change 
pinlein structure radically, the mistranslation would be a costly mistake. 
Through evolutionary lime, one by one, the AUA codons in a messenger 
RNA become replaced with AUU or AUC, minimizing this type of misincor- 
poimtion error. This misincorporation hypothesis for AUA codon avoidance 
has not been tested, but it is testable 


Alternative alleles of a gene differ in their sequence of nucleotides (Figure 
1.3) For example, where one allele has a T -A base pair in the DNA, another 
may haw a C-C base pair at the same position Because of redundancy in the 
code, rail all nucleotide substitutions result in a replacement of one amino 
acid for another. In Figure 1.3B, lor example, if a mutation at the third posi- 
tion in the second codon (asterisk) changes one pyrimidine into the other, the 
new codon still codes for histidine On the other hand, some nucleotide sub- 
stitutions at the third position do result in amino acid replacements. For 
example, in Figure 1.3C, if the third position in the second codon changes 
from a pyrimidine to a purine, the codon changes from one for histidine to 
one for glutamine. Most nucleotide substitutions at codon positions one and 
two result in amino acid replacements (Figure 1 2D). 

Not all alleles differ by a mere nucleotide substitution. Relative to the typ- 
ical or wildtype allele, some alleles may have a deletion of a number of 
nucleotide pairs or an insertion into the DNA molecule. The number of 
nucleotides deleted or inserted may be small (as few as one nucleotide pair) 
or large. Some insertions are thousands of nucleotide pairs in size. Many 
large insertions result from the activity of transposable elements, which are 
specialized sequences of DNA able to replicate and insert at novel positions 
virtually anywhere in the DNA of the organism in which they are present 
Alleles also may differ in the number of copies of short sequences present in 
tandem arrays in the DMA. For example, near many genes in human beings 
are tandem copies of dinucleotides, such as 5'-CACACACA . . . -3'. Such a 
repeating sequence is symbolized as (5'-CA-3')ji The number of copies («) of 
the dinuclcotide repeat often range from fewer than ten to hundreds, and the 
number of copies may differ dramatically from one allele to the next. Some 
alleles even differ from wildtype in having an inversion of the nucleotide 
sequence in a region of DNA. 

Genotype and Phenotype 

Within a living cell, genes are arranged in linear order along microscopic 
threadlike bodies called chromosomes. A typical chromosome may contain 

Genetic and Statistical Background 9 









Annino acids Polypeptide chain 

Figure 1 3 Alleles are alternative forms of a gene (A) The arrows show how 
the genetic information in a portion of the nucleotide sequence of DNA specifies 
the amino acid sequence in a portion of a polypeptide Each group of three adja- 
cent nucleotides corresponds to one amino acid in the polypeptide (B, C, D) 
Substitution of one nucleotide for another m the DNA (indicated by the aster- 
isks and heavy lines) can result in the replacement of one amino acid for anoth- 
er in the polypeptide 

several thousand genes The position of a gene along a chromosome is called 
the locus of the gene. In most higher organisms, each cell contains two copies 
of each type of chromosome. Such organisms, in which the chromosomes are 
present in pairs, are said to be diploid In each pair of chromosomes, one 

1 Chapter 1 

Genetic and Statistical Background 1 1 

member is inherited from the mother through the egg and the other is inher- 
ited from the father through the sperm At every locus, therefore, diploid 
organisms contain two alleles, one each at corresponding positions in the 
maternal and paternal chromosomes If the two alleles at a locus are 
chemically identical (in the sense of having the same nucleotide sequence 
along the DNA), the organism is said to be homozygous at the locus under 
consideration; if the two alleles al a locus are chemically different, the organ- 
ism is said to be heterozygous at the locus. The term gene is a general term 
usually used in the sense of locus. 

Geneticists make a fundamental distinction between the genetic constitu- 
tion of an organism and the physical or biochemical attributes of the organ- 
ism. The genetic constitution of an organism is called the genotype; genotype 
thus refers to the particular alleles present in an organism at all loci that affect 
the trait in question. For example, if a trait is influenced by two genes, each 
with two alleles, then there are nine possible genotypes, as follows. 

AA;BB AA,Bb AA;bb 

Aa,BB Aa;Bb Aa;bb 

aa;BB aa;Bb aa;bb 

where A and a refer to the alleles of the first gene and B and b refer to the alle- 
les of the second gene. In some cases when the genes are linked (located in 
the same chromosome), it is sometimes necessary to distinguish between the 
genotypes AB/ab and Ab/aB, in which case there are ten possible genotypes. 
In contrast to genotype, the physical expression, of a genotype is called the 
phenotype. Examples of phenotypes include hair color, eye color, height, 
weight, number of kernels on an ear of corn, number of eggs laid by a hen, 
and round versus wrinkled pea seeds The distinction between the genetic 
constitution of an organism (genotype) and the physical or biochemical 
attributes of the organism (phenotype) is particularly important in cases in 
which the environment can affect the tiait; in such cases, two organisms with 
the same genotype can nevertheless have different phenotypes because of 
differences in the environment, Conversely, two organisms with the same 
phenotype can have different genotypes. 

PROBLEM 1.1 WageneinacKploidc^arusmhasmalfe^ftvealfel^ ' '} 
show that the number of possible genotypes equals m(m + 2)/2. ' 


ANSWER:, Consider; first the* heterozygoses. There are m ways of 
choosing the first allele and, having done that, there are m - 1 ways of 
choosing a different $ecohd allele. Altogether, there are m{m - l)/2 

'.] different heterorygotes. The division by 2 is necessary because, for 
■each hfiterozygote---say, AfAp-ty makes no difference whether A t was 

,, dhpsen first and A t second pr tibe other way around. In addition to 
the ^heteroaygtrtes, flWre <*re w possible hotrtozygotes. Hence, the total 
rnanber of diploid genotypes equals [m(tti - 1)/2J + m = m(m + l)/2. 

Dominance and Gene Interaction 

Whether each genotype has a single, unique expression of the trait depends 
on the manner in which the alleles of a gene interact in development For the 
alleles of one gene, dominance refers to the concealment of the presence of 
one allele by the strong phenotypic effects of another. For example, with two 
alleles there are three possible genotypes: 

AA Aa aa 

Several types of dominance are distinguished and exemplified in the fol- 
lowing examples: 

• Complete dominance: A is completely dominant to a if the phenotypes of 
AA and Aa cannot be distinguished. 

• Incomplete dominance: A shows incomplete dominance with respect to n if 
the phenotype of Aa is intermediate between that of AA and that of aa. 
This situation is also referred to as partial dominance or intermediate 
dominance. When the phenotype can be measured on a quantitative 
scale, for example, the number of kernels on an ear of corn, and the phe- 
notype of Aa is exactly the average between that of AA and that of aa, 
then the alleles are said to be additive alleles and the type of dominance 
is sometimes called semidominance. 

• Codomwance: A and a are codominant if the products of both alleles can 
be detected in An heterozygotes. Many alleles are codominant at the 
level of their protein products because two different forms of the 
polypeptide, encoded by A and a, can be detected in heterozygotes At 
the level of the DNA sequences, all alleles differing in DNA sequence are 


12 Chapter 1 

It is important to note lhal dominance is not a characteristic of alleles so 
much as a characteristic of the manner in which the phenotype is examined. 
An allele may show complete dominance if the phenotype is examined in 
one way, no ■dominance if examined in another, and if exam- 
ined in stilt another. For example, the allele for round pea seeds W studied by 
Grogor Mendel is completely dominant to that for wrinkled seeds w when 
the phenotype "round" veisus "wrinkled" is examined. The genetic defect in 
wrinkled seeds is the absence of an enzyme needed for the synthesis of a 
brartched-chain form of starch. Microscopic examination reveals subtle dif- 
ferences in the form of the starch grains in seeds of the three genotypes: WW 
seeds contain large, well-rounded starch grains, retain water and shrink uni- 
formly as they ripen, so the seeds do not become wrinkled, ww seeds lack the 
branched-chain starch and are irregular in shape because the ripening seeds 
lose water more rapidly and shrink unevenly. However, heterozygous Ww 
seeds have starch grams that are intermediate in shape even though the seeds 
shrink uniformly and show no wrinkling. Therefore, at the level of the starch 
grains, there is incomplete dominance of IV and w because the starch grains 
in the heterozygotes are intermediate between the two homozygofes. Fur- 
thermore, the difference in DNA sequence between W and w can readily be 
detected with modern methods, so that W and w are codominant at the level 
of DNA sequence. 

For traits affected by more than one gene, the relation between geno- 
type and phenotype depends not only on the degree of dominance of the 
alleles of each gene but also on the type of interaction between the genes in 
development. For example, suppose that the trait in question is degree of 
pigmentation and that pigmentation is determined by two alleles of each 
of two genes, say, A, a and B, h. Suppose further that the total amount of 
pigment in an organism results from the total number of A and B alleles 
piesent, each of which adds a single unit of pigmentation to the pheno- 
type. 1 hen, as shown in Table I 2, there are only five possible levels of pig- 
mentation (0 through 4) and genotypes aa BB, An Bb, and AA bb all have the 
same phenotype. Because each uppercase allele adds the same quantity to 
the total phenotype, the type of gene interaction in Table 1.2 is said to be 

Segregation and Recombination 

The essential mechanism of inheritance was established by Gregor Mendel 
(1822-1884) in experiments with garden peas carried out in the years 1856 to 
1863 in a small garden plot next to the monastery in which he lived. Mendel 
showed that the alleles of each gene segregate from one another in the for- 
mation of reproductive cells or gametes. Because of segregation, heterozy- 
gous genotypes form equal numbers of gametes containing each allele. 

Genetic and Statistical Background 1 3 



Amount of pigmentation 

An BB, 
an BB, 
Aa hb, 
<ia bb 

AA Bb 
Aa Bb, 
aa Bb 

AA bh 

"At left arc shown Hue nine possible genotypes of two genes wilh two alleles of each gene At 
ngtot is shown the amount of pigmentation expected in each genotype when it \s assumed 
that each allele designated by an uppercase letter is responsible for producing a certain 
amount of pigment 
''Measured as an increase in pigmentation ovet that in aa bb genotypes 

Furthermore, because gametes unite at random in fertilization, the following 
are the results of simple Mendelian segregation: 

• AAxAA matings produce all A A progeny. 

• AA x Aa matings produce '/ 2 M and V 2 Aa progeny. 

• AAxaa matings produce all Aa progeny. 

• AaxAa matings produce l / 4 A~K V 2 Aa, and V 4 aa progeny. 

• Art x aa matings produce V2 Aa and '/ 2 aa progeny 

• ctoxna matings produce all aa progeny. 

The physical basis of Mendelian segregation is that the maternal and 
paternal pairs of chromosomes are separated into different cells in the forma- 
tion of gametes, Prior to their separation, the maternal and paternal chromo- 
somes associate intimately all along their length and alleles may be 
interchanged in the process of recombination (Figure 1 4). The interchange of 
parts takes place after the chromosomes have replicated, and only two of the 
lour chromosome strands participate in any one exchange. Recombination 
results in the creation of allele combinations different from either parental 
chromosome, in Figure 1.4, the /I b and a B combinations are recombinant, 
whereas the A Band a b combinations are parental (nonrecombinant) There- 
fore, a single exchange between parental chromosomes results in I wo recom- 
binant and two nonrecombinant gametes. 

In organisms with an XX-XY chromosomal mechanism of sex determina- 
tion, Mendelian segregation randomizes the sex ratio at fertilization In mam- 
mals and many other animals, sex is determined by sex chromosomes: males 
have an X and a Y chromosome, and females have two X chromosomes In 
males, the X and Y chromosomes segregate, yielding equal proportions of 

1 4 Chapter 1 

§ ' 1/4 

I "' 1/4 


Figiure 1 .4 Recombination results from a physical interchange of pai Is 
between chromosomes. New combinations of alleles are created that differ from 
eithei chromosome I he physical inlei change of parts takes place in 
gamete formation after the chromosomes have replicated, and only two of the 
foui chiomosomr strands pai licipate in any one exchange 

Genetic and Statistical Background 1 5 

X-bearing and Y-bearing sperm. If both types of sperm are equally able to fer- 
tilize eggs, then random union of sperm with eggs yields l / 2 XX (female) and 

V 2 XY (male) chromosome constitutions. 


The basic concepts of probability needed for elementary population genetics 
are quite straightforward They will be introduced with the concrete example 
of genetic segregation in Figure 1 .5, whrch deals with the progeny of the mating 

(A) Addition rule 
Mating AaxAa 

Offspring ^AA i ^Ati + | flu 

A- means "Offspring either AA or Aa" 

(B) Multiplication rule 

Birth Order 










3 3 1 27 
■t * 4 * -1 64 





■1 y I x ] - - - 9 
4 4 4 " 64 





4 - 4 4 (V| 





13 3 9 
4 X 4 * 4 ~ 64 





4 "4 4 64 





! * 5 x ! _ 2 

4 K 4 * 4 ~~ 64 





113 3 

4 * 4 * 4 ~ 64 





4 4 4 64 

Figure 1,5 Basic concepts of probability illustrated by Mendelian segregation 
in the mating Aa x Aa. The elementary outcomes of the mating are the possible 
genotypes of each progeny-— /l/l, Aa, and aa — and these aie realized with proba- 
bilities !/,, !/ 2 , and %, respectively. (A) The compound event A- consists of the 
two elementary outcomes AA and Aa, and the probability of A- is the sum of 
the probabilities of these elementary outcomes (addition rule) (B) The possible 
distributions of genotypes A- and aa in sibships of size three offspring. Succes- 
sive births are independent, and so the probability of any sibship equals the 
product of the probabilities for each birth separately (multiplication rule) 

16 Chapter 1 

Aa x A? Considerations in probability always begin with an experiment of 
some kind. The experiment may be either a real experiment or a conceptual 
experiment In Figure 1.5, it is a conceptual experiment in which Aa is crossed 
with An. In probability calculations, U is also necessary lo define all possible 
outcomes of the experiment The outcomes are called elementary outcomes 
because they are defined in such a way that, in any repetition of the experi- 
ment, one and only one of the elementary outcomes must be realized. For 
example, if we are interested in the genotypes among the progeny of (he mat- 
ing An, the possible elementary outcomes for each offspring are either AA, Aa, 
or aa (Note that, in defining these as the elementary outcomes, we are ignor- 
ing the possibility of either A or a mutating to a novel allele ) To proceed fur- 
ther, we must assign to each elementary outcome a probability, a number 
between and 1 that measures how much confidence we have that the out- 
come will be realized. The probabilities assigned to the outcomes are based on 
genetic reasoning, intuition, or experience. One requirement of the assigned 
probabilities is that the probabilities of all the elementary outcomes must add 
to I, this is the mathematical consequence of requiring that one of the elemen- 
tary outcomes must be realized For example, i f there are three elementary out- 
comes, and all are equally probable, then each has a probability of %, In Figure 
1 5, the probabilities assigned to the elementary outcomes AA, Aa, and aa are 
! /4« '/;>, and V 4 , respectively, because these are the relative proportions of the 
three progeny genotypes expected from Mendelian segregation. 

The Addition Rule 

An outcome of a conceptual experiment is an event. The distinction between 
an event and an elementary outcome is that an event can include more than 
one elementary outcome For example, in Figure 1.5A, the event "the off- 
spring has at least one copy of the dominant A allele" consists of two ele- 
mentary outcomes, namely, genotypes A A and Aa This event may be sym- 
bolized A-, where the dash indicates that the unspecified allele may be either 
A or a For events defined in terms of elementary outcomes, the probability 
of an event equals the sum of the probabilities of the elementary outcomes 
included in the event. In the present example, 

Vr{A~) = Pr(AA) + Pr(Aa) = '/, + i/ z = % 

More generally, two events are mutually exclusive if they cannot be real- 
ized simultaneously. The addition rule slates that, for mutually exclusive 

events, the probability that either one or the other is realized equals the sum 
of the probabilities of the separate events. 

The Multiplication Rule 

Figure 1 SB shows all possible genotypes of sibships of three offspring from 
the mating Aa x An, with each offspring classified as A- versus aa. (A sibship 

Cenetfc and Statistical Background 1 7 

is a group of brothers and sisters ) The probability of A- in any particular 
birth is 7 4 and that of aa is V 4 . The probabilities at the right are the overall 
probabilities for each ol the sibships. They are obtained by multiplication of 
the probability for each birth because successive births are independent, 
which means that the genotype of any birth has no effect on the genotype of 
any other birth. Because of the independence, among the % of the sibships 
with A- in the first birth, Y 4 will have A- in the second birth, and among 
the % x % of the sibships with A- in the first two births, % will have 
A- in the third birth. Therefore, the overall probability of three A- births 
is % x 3 / 4 x 3 / 4 . The reasoning for the other types of sibships is similar. More 
generally, the multiplication rule states that, whenever two events are inde- 
pendent, the probability of their joint realization is the product of the prob- 
abilities of their being realized separately. 

Repeated Trials 

The sibships in Figure 1 5B are an example of repeated trials of a conceptu- 
al experiment. Repeated trials are encountered frequently in probability 
They govern tosses of a coin or dice, deals of cards, successive spins of a 
roulette wheel, and so forth. Repeated trials are also important in population 
genetics because successive offspring of a mating are independent events 
and thus repeated trials Furthermore, it is apparent from Figure 1 5B that the 
different birth orders are mutually exclusive - any sibship can have one and 
only one birth order of A- or aa. Because the birth orders are mutually exclu- 
sive, their probabilities may be combined by the addition rule. Hence, the 
composite events below have the following probabilities' 

Pr(two A- and one aa) = 9/64 + 9/64 +- 9/64 = 27/64 
Pr(one A- and two aa) = 3/64 + 3/64 + 3/64 = 9/64 

Note in Figure 1.5B that, when the sibships with the same number of A- 
and m genotypes are combined, the overall probabilities are given by succes- 
sive terms in the expansion of: 

V/ i A-+^aa?=\x{y 4 f 

+ 3x(-y 4 )W 

+ 3x(3/ 4 )'(i/ 1 ) 2 
+ lx(V 4 ) 3 

A- A- A- 
A- A- aa 

A- aa aa 
aa aa aa 

The coefficients 1 : 3 : 3 : 1 are the number of combinations in which each 
triad of genotypes can be born: 1 for A- A- A-, 3 for A- A- aa (because the aa 
genotype can be born either first, second, or third), and so forth. Each power 
of -y 4 and 1 /a is the probability that any one of the birth orders will be realized, 

1 8 Chapter 1 

for example, (Y 1 ) 2 ( l / 4 ) 1 is the probability that any sibship with two A- and one 
aa genotype will be realized 

fn nil 1 cases of repealed and in dependent trials, the overall probabilities 
are given by analogous expansions Suppose that any one trial may result in 
either of two mutually exclusive events, A or B, and that the probability of 
event A is p and that of event B is q (with p + q = 1). Among a total of it inde- 
pendent trials, what is the probability that A is realized exactly r times and B 
is realized exactly n - r times 7 By the multiplication rule, any particular com- 
bination of r /Is and n -r ffc has a probability p'q"~' Deducing the total num- 
ber of combinations of > As and n - r Bs is a little less obvious, but it is given 
by the coefficient of the term p'tf~' in the expansion of (p + q}", which equals 



where the exclamation point means the factorial, the product of all integers 
from 1 thiough (he number m question. For example, u! = ]x2x3x • - - xii 
For consistency, the number 0! is defined as 0! = 1. 

Equation 1.1 is often called a binomial coefficient because it arises in 
the expansion of the two terms (p +• q)". To understand the reason why 
Equation 1 I yields the correct number of combinations of r As and {n - r) 
Bs, first consider what the »' means. It is the total number of ways that any 
set oi it objects can be arranged in order. There are /; ways to choose the first 
object and, having chosen the first, u - I ways to choose the second and, 
having chosen the first two, » - 2 ways to choose the third, and so on, yield- 
ing n x {n - 1) x (n - 2) x ■ ■ - x 1 = n\ Fiu thermore, for each arrangement of 
n objects of which r are As and(» - r) are Bs, there are r! ways to arrange the 
As among themselves and (n - r) 1 ways to arrange the Bs among themselves, 
for a total of rl x (n - r)! arrangements. Because each of the w' combinations 
of r As and (n - rj Bs includes r! x {h - r)l equivalent arrangements of the As 
and Bs, the total number of different arrangements of r As and (n - r) Bs 
equals the ratio given in Equation 1 I. 

Equation I 1 gives the number of different arrangements of r As and (if-r) 
B.s. Each arrangement has a probability given by p'tf' 1 . Therefore, using the 
addition rule, Ihe probability that n repeated trials yields r realizations oi A 
and (n - r) realizations of ft equals 

>'(<!-, •)' 

p'q" ' 

1 2 

As an example of the use of Equation 1 2, consider the probability (hat a 
sibship of 12 offspring from the mating An x An perfectly matches the 

Genetic and Statistical Background 1 9 

expected N4endelian ratio of 9 A- and 3 aa. In this case, p = -%, q = %, n - 12, 
r = 9, and n - r = 3. The required probability from Equation 1 2 is therefoie 

i^-(l) fit =220x0.0751x0 0156 = 0.258 


The implication of this calculation is that, whereas Ihe "expected" ratio is 
9 A- : 3 aa, only a little more than 25% of such sibships actually have the 
expected distribution. 

f^OHtiM 1 .1 Sfctppd^ tjrwrt a society decided to limit the number 

!'^^:!aijM\lty; reproduction to any 

'* ;#»Mit \WM giv^s birth "tA ■* :male child. Given a ratio of males to 
l' 1 ^ females' at!b,^fh'(rf I 'i 1/fuW would such a law affect the sex ratio? 
!*;; Sttppo&e' ffirthar that, ki practice, any woman who has a female child 
|fi: ''Wkmtarily t^ra^feftirite r^roduction with probability p. In this 
' ' ' 'ett^' the prD^ortaorfc Of males in sibships of size w? 

■ jtitfe'tif^ Wdttid have nis effect on the sex ratio. To under- 
ptot^Akk &e fiirBt birth across the entire population. The 
fcifit!ft«g th^dfep^iitg mm be 50% males. Consider now the 
tfcfrih. Th&se* Mtbj among these offspring must also be 50% 
^)jyt Iftfc '«Ht Mi In arty birth must be 50% males, and so 
"'"" fft^lfe^llli^ukttfJftbf births as a whole. In regard to 
i Jftrt trf^JftcfcWi,, hole that sibships of size n can be sep- 
, ;$iMjjKt1ft& 'bvo'ei^ses: ttk«* irt Whkh the final birth is a male (and 
^^kot^'afui^^ is denied) and those in which the 

g -"final birth is it girl (in which the mother voluntarily stops reproducing 
'J.' with probability p). these types of; sibships occur in the ratio V*: 
P. 'pill WWch tenis fai the proportions 1/{1 + p) and p/(l + p), respec- 
P tf^djfc Tftfe first type ttf Sibship has a proportion of males of l/« and 
p; . tiw Siscdnd has a propqrtioit of males of 0. Hence, the proportion of 
f ' males as a function of sibship sisse equals (1/n) x [1 /(I + p)] + x 
Li; .1^/(1 + p)l '* 1 /jMji + p). Note that, for ^) = 0, the proportion of males as 
f l '"it function of sibship size decreases according to the series 1, V^ V3, 
j| : , V* .''; >,. Nffterfrteless, the sex ratio in the population as a whole equals 
'; ; !% for tilis arid any other value of p. 

20 Chapter 1 


One of the universal attributes of natural populations is that organisms dif- 
fer in phenolype with respect to manv Ira its Phenotypic diversify in many 
Ira its is impressive even with the most casual observation Among human 
beings, for example, there is diversity with respect to height, weight, body 
conformation, hair color and texture, skin color, eye color, and many other 
physical and psychological attributes or skills. Population genetics must deal 
with this phenotypic diversity, and especially with that portion of the diver- 
sity that is caused by differences in genotype. In particular, the field of pop- 
ulation genetics has set for ilself the tasks of determining how much genetic 
variation exists in natural populations and of explaining its origin, mainte- 
nance, and evolutionary importance. Genetic variation, in the form of multi- 
ple alleles of many genes, exists in most natural populations, hi most sexu- 
ally reproducing populations, no two organisms (barring identical twins or 
other multiple identical births) can be expected to have the same genotype 
for all genes. Thus, it becomes important to describe how alleles in natu- 
ral populations are organized into genotypes — to determine, for example, 
whether alleles of the same or different genes are associated at random. 

Alkie frequencies in Populations 

Much of the phenotypic variation in natural populations does not yield sim- 
ple Mendelian segregation ratios such as 1 : 1 or 3 : 1 in pedigrees. Some dif- 
ferences in phenotype are environmental m origin and so are not expected to 

show Mendelian segregation. However, simple Mendelian segregation is not 
usually observed even for traits whose expression is influenced more or less 
strongly by genetic factors. Although the underlying genetic factors do seg- 
regate in pedigrees in Mendelian fashion, the segregation is concealed by 
several complications First, environmental effects on the trait may be strong 
enough to mask the genetic segregation. Second, genetic effects on many 
trails are determined by the joint effects of the alleles of two or more genes, 
and the segregation of any one gene in a pedigiee may be obscured by the 
segregation of others. 

On the other hand, some phenotypic diversity in populations does show 
simple Mendelian segregation, In the snapdragon Antirrhinum majus, for 
example, whether the flower color is red, pink, or white is determined by the 
alleles I and ( of a single gene. The genotypes II, It, and ii have red, pink, and 
while flowers, respectively, an example of incomplete dominance. 

Populations containing both the / and i alleles will include plants whose 
flowers are red (11), pink (//), or white (n) in proportions determined by the 
allele frequencies of the I and i alleles in the population as well as by the 
manner in which the alleles are united in fertilization By the allele frequen- 
cy of a specified allele, we mean the proportion of all alleles of the gene that 
are of the specified type. To take a hypothetical example, suppose 400 

Genetic and Statistical Background 21 

members of a population were classified as to flower color and the finding 
was: 165 red, 190 pink, and 45 white Because the flower color reveals the 
genotype, we may infer that the sample of 400 includes 165 U, 190 h, and 45 n 
genotypes. The observed numbers of / and / alleles are therefore - 

1-2x165 + 190 = 520 

r 190 + 2x45 = 280 

The factors of 2 are included for the homozygous genotypes because each 
II genotype contains two 1 alleles and each ii genotype contains two i alleles. 
The total number of alleles in the sample equals 2 x 400 = 800 Therefore, if 
we let p represent the frequency of the / allele and q represent the frequency 
of the i allele (with p + q = 1 because these are the only alleles of the gene in 
question), then we can estimate p and q from the observations as: 

p = 520/800 = 0.65 

q = 280/800 = 0.35 

Mote that, if the / and i alleles were combined into genotypes at random, 
the expected frequencies of three genotypes can be calculated from the rule 
for repeated trials by expanding the binomial (p J + q if = p 2 Jl + 2pq li + q 2 ii. 
Therefore, assuming random combination into genotypes, the expected num- 
bers of the three genotypes are; 

II: (0.65) 2 x 400 = 169 

Ik 2 x 0.65 x 0.35 = 182 

ii: (0.35) 2 x 400 = 49 

Hence, the observed numbers in this hypothetical population are very 
close to those expected with random combinations of alleles. The proportions 
p 2 , 2pq, and q 2 for the three genotypes when two alleles are combined at ran- 
dom constitutes the Hardy-Weinberg principle, which is one of the basic 
principles in population genetics. The Hardy-Weinberg principle is discussed 
in detail in Chapter 2. 

i . ''Pl$0tlMA 1.3 Suppose mat a random sample of 400 snapdragons 
V jfipbM A population includes 185 red, 150 pink, and 65 white. Estimate 
! the allele frequency p of I and q of i. Assuming random combirta- 
i tiohs of alleles in the genotypes, what are the expected numbers of 

22 Chapter 1 

the three genotypes? Do the observed data seem to fit the expecta- 

ANSWER Among the total of 800 alleles, the observed number of I 
alleles is 2 x 185 + 150 = 520 and that of i alleles is 150 + 2 x 65 ~ 280. 
Therefore, p = 520/800 = 0.65 and q= 280/800 = 0.35. Note that ihe 
estimated allele frequencies are the same as above, even though the 
observed numbers of the genotypes are different. With random com- 
binations of alleles in the genotypes, the expected numbers are again 
169 red, 182 pink, and 49 white. Compared to the observations, there 
appear to be too many homozygous genotypes and too few heterozy- 
gous genotypes. (A statistical method for deciding whether the fit is 
satisfactory or not is discussed in Chapter 2.) 

Parameters and Estimates 

In the discussion of flower color in snapdragons, we made a subtle distinc- 
tion between the allele frequency of the I allele (designated p) and the 
estimated allele frequency of the f allele (symbolized p). The distinction is 
necessary whenever an experimenter makes inferences about an entire pop- 
ulation from an examination of a random sample from the population. 
Quantities used in describing entire populations arc parameters. In the snap- 
dragon example, the parameter of interest is the allele frequency p of / in the 
entire population. Because we only have access to a sample of 400 organisms 
from the population, the true value of r> is unknown. The best we can do is 
make an estimate of p based on a sample, hoping that the sample is repre- 
sentative of the population as a whole. The estimate obtained from the sam- 
ple is designated p to emphasize that it is an estimate rather than the true 
value. In this book, whenever it is necessary to distinguish parameters from 
their estimates, we use uncmbellished symbols foi paiameters (for example 
p for the unknown frequency of an allele in a specified population) and the 
same symbol with a cucumflex for the estimated value (in this example;;) 

The Standard Error of an Estimate 

The distinction between a parameter and an estimate is important because 
differenl samples may yield different values of the estimate for the same rea- 
son that different sibships may yield different segregation ratios, namely, 
chance variation from one icpeated trial to the next The estimation ol an 

Genetic and Statistical Background 23 

allele frequency can be treated as repeated trials by supposing that the alle- 
les are sampled at random, one by one, from a very large population. In the 
snapdragon example, there are 800 alleles sampled ff the allele frequency of 
} has the true value p - 0.65, then the repeated-trials interpretation implies 
that all possible outcomes of 800 trials have probabilities given by successive 
terms in the expansion of (0.65 J + 0.35 i) m . This is not an expansion that one 
would want to do by hand, but the binomial expression makes evident the 
underlying random-sampling process that accounts for variation in the esti- 
mate of p from one sample of 800 alleles to the next. 

Unless p is quite close to or quite close to 1, there is a convenient approx- 
imation to the binomial expansion {p ! + q \)'\ where n is the number of alleles 
sampled. As n becomes large, the distribution of p approaches the familiar 
bell-shaped curve called the normal distribution. The normal distribution fea- 
tures prominently in the analysis of traits determined jointly by multiple 
genetic and environmental factors and it is discussed in detail in that context 
(Chapter 9). For present purposes, it is sufficient to note that the degree to 
which the values of p are clustered around the overall average depends on a 
quantity called the standard era»r: 



where q = 1 - p. li the sampling and estimation of p were repeated many 
times using the same population, then the values of p would be expected to 
be clustered symmetrically around p according to the standard error as fol- 

• Approximately 68% of the estimates p lie within plus or minus one stan- 
dard error of p. 

• Approximately 95% of the estimates p lie within two standard errors of p. 

• Approximately 997% of the estimates p lie within three standard errors 

of p. 

To put the matter in another way, with repeated sampling, 32% of the esti- 
mates would be expected to differ from the true value by more than one stan- 
dard error, 5% by more than two standard errors, and only 0.3% by more 
than three standard errors. 

As an illustration of the variation among repealed estimates of p, Figure 
1.6 shows the values of p obtained in 100 repetitions of the experiment of 
sampling 800 alleles from a large population in which the true allele frequen- 
cy is p = 0.65. Each of the 100 samples was created by computer simulation 
using a random-number generator that yielded a 1 with probability 65 and 
a with probability 35. For each sample of 800, therefore, the estimate p 
equals the number of Is in the sample divided by 800, As is evident in 
Figure 1 6, the distribution of p values is more or less bell-shaped but not 

24 Chapter 1 

Dovinlinn from mean (Sir") 
-3 -2 -I t-1 \1 -H 

Allele frequency estimate 

I I 

Figure 1.6 Estimates of allele frequency based on 100 samples, each of size 
400 diploid organisms, from a population in which the actual allele frequen- 
cy is 0.65 The standard error equals 0,017, and the distribution of the esti- 
mates is very close to the bell-shaped distribution expected theoretically. 
The scale across the top gives (he ranges of the estimates as multiples of the 
standard error 

exactly so because it is based on only 100 samples rather than an infinite 
number. The overall mean p from all 100 samples combined (80,000 observa- 
tions) equals 0.6492, which is very close to (he frue value off). Furthermore, 
the distribution of the estimates fits the predictions based on the standard 
error quite well. 

To apply Equation 1.3 to the data in Figure 1.6, note first that p = 0.65 
with n = 800, and so s in Equation 1 3 equals /[(065~x 0.35)/ 800] =0.017. 
Because 68% of the samples are expected to yield values of p in the range p 
± .■?, and because the expected distribution is symmetrical, 34 of the values in 
Figure 1.6 are expected in the range p-sto p (0.633 -0.650) and 34 in the 
range p to p + s (0.650-0 667); the actual numbers are 33 in the first interval 
and 35 in the second. By the same reasoning, 95% of the values should lie in 
the range p + 2s, or 47.5% on each side of the mean; because 34% of the values 
on either side of the mean are in the range p ± s, the implication is that 47.5 - 34 
or 13 5% of (he values should lie in (he range p- 2s to p - s and 13.5% should 
lie in the range p 4 s lo p + 2s For the data in Figure 1 6, these ranges are 
0.6J6 -0.633 and 667-0 684; the actual number in each interval is 18 and 10, 

Genetic and Statistical Background 25 

respectively, as against the theoretical 13.5 m each. Likewise, the standard 
error predicts thai 0.3% of the samples will deviate by more than 3s from the 
mean, as compared with the observed 2. 

Estimates and their standard errors are often presented as p ± s, or 65 ± 
017 in the present example. The 68%, 95%, and 99 7% cutoffs for ± 1, ± 2, 
and ± 3 standaid errors provide one mannei in which the reliability of an 
estimate may be interpreted. Estimates may also be presented alternatively in 
terms ol a range called a confidence interval, which expresses a degree of 
confidence that the true value of a parameter lies in some specified interval 
The most frequently encountered confidence interval is the 95% confidence 
interval, defined as the interval (p - 2s, p + 2s) Because 95% of repeated sam- 
ples are expected to yield estimates in a range ± 2s around the true mean, 
then 95% of the time the interval (p -2s) - (p +- 2,s) is expected to include the 
true value of the parameter p. In the snapdragon example with p = 0.65 and 
s = 017, the 95% confidence interval isO.616-0.6B4. 

rw^wts**^ «Mfl* : ' 

'.'(•JUtWtftM 1 *4 ;The''Mrj| bloodt .groups in human; beings are deter- 
giii^tytwri alleles of a single gette^ designated M and IV, Each allele 
results trt th£ production of a different type of polysaccharide mole* 
oileGrt the Surface Of red Wood cells, which can be distinguished by 
mean* Of appropriate chemical reagents. The types of molecules cor- 
responding to the M arid N alleles are designated M and N, respec- 
tively, the M arid N alleles are codominant; that is, genotype MM 
produces only the M substance and has blood group M, genotype NN 
ptedtites only the N substance and has blood group N, and the het- 
erozygous genotype MN produces both the M and N substances and 
has blood group MN, Among a' sample of 1000 British people (Race 
atid Sanger 1975), the observed numbers of each blood group were 
"life H, 489 MN, and 213 N. Using these data, estimate the allele fre- 
t f of the M allele and calculate its standard error. What are the 
M%, and 99.7% confidence intervals for pi 

ANSWER because each genotype has a unique phenotype, the 
sample, contains 2 x 298 + 489 = 1085 M alleles, and so p = 1085/2000 
; ~ G.542S, The standard error s = ^(0.5425)0 0.5425)/ 2000 = 0.0111. The 
68%, 95%, and 99.7% confidence intervals for p are p ± Is, 2s, 
and 3s* respectively, and so the confidence intervals are 0.5314 - 0.5536 
(68%), 0.5202 - 0.5647 (95%), and 0.5092 - 0.5758 (97.5%). 

26 Chapter 1 


Population geneticists must contend with factors such as population size, 
patterns of mating, geographical distribution of organisms, mutation, 
migration, and natural selection Although we wish ultimately to under- 
stand the combined effects of all these factors and more, the factors are so 
numerous and interact in such complex ways that they cannot usually be 
grasped all at once. Simpler situations are therefore devised, situations in 
which a few identifiable factors are the most important ones and others can 
be neglected An intentional simplification of a complex situation is a 
model. There are several types of models, each designed to eliminate extra- 
neous detail m order to focus attention on the essentials. Some models are 
experimental An experimental model may consist of a laboratory experi- 
ment with population cages of Drosoplnla or growing cultures of bacteria. 
An experimental model may also consist of observations of natural popula- 
tions in particular locations or at particular times in which evolutionary 
forces of interest may be presumed to be present. Models of this type 
include the study of the origin and spread of insecticide resistance in insects 
or antibiotic resistance in bacteria. 

A model may also be a conceptual simplification Conceptual models 
have a number of uses. They require a concise statement of presumed mech- 
anisms and interactions; they afford a framework for interpreting observa- 
tions and setting research priorities; they enable extrapolation into the future 
or beyond the range ol known parameters; and they suggest tests of consis- 
tency between theory and observation 

A conceptual model may consist of verbal arguments logically linking a 
chain of hypothesis and deductions Another type ol conceptual model is a 
computer program that simulates the random component in a process or that 
calculates the values of changing quantities in a complex system based on 
prescribed numerical relations. An example of a computer model is the one 
for examining the result of repeated random sampling whose outcome is 
depicted in Figure 1.6, In population genetics, a kind of model frequently 
encountered is a mathematical model, which is a set of hypotheses that spec- 
ifies the mathematical relations between measured or measurable quantities 
(the parameters) in a system or process Mathematical models can be 
extiemely useful' 

• They expiess concisely the hypothesized quantitative lelationships 
between paiamelers 

• They reveal which parameters ate the most important in a system and 
thereby suggest critical experiments or observations. 

• They serve as guides to the collection, organization, and interpretation of 
observed data 

Genetic and Statistical Background 27 

• They make quantitative predictions about the behavior of a system that 
can, within limits, be confirmed or shown to be fake 

The validity of any mode! must be tested by determining whether the 
hypotheses on which it is based and the predictions that grow out of it aie 
consistent with observations. 

A mathematical model is always simpler than the actual situation it is 
designed to elucidate. A model is supposed to be simple If it is not simpler 
than the real situation, then it isn't a model. Models are simpler than real sit- 
uations because many features of real life are intentionally ignored To 
include every aspect of a complex system would make a model too complex 
and unwieldy Construction of a model always icqiiires a compromise 
between realism and manageability. A completely realistic model is likely to 
be too complex to handle mathematically and a model that is mathematical- 
ly simple may be so unrealistic as to be useless Ideally, a model should 
include all essential features of the system and exclude all nonessential ones 
How good or useful a model is often depends on how closely this ideal is 
approximated. In short, a model is a sort of metaphor oi analogy. Like all 
analogies, it is valid only within certain limits but, when pushed beyond 
these limits, becomes misleading or even absurd 

In this book, we are going to take many liberties with mathematical rigor. 
Our excuse is that the basic ideas of a model are often obscured rather than 
illuminated by excessive attention to mathematical detail Our authority for 
the approach is the great physicist Richard Feynman, who wrote in one of his 

Mathematicians may be completely repelled by the liberties taken here The 
liberties are taken not because the mathematical problems are considered 
unimportant On the contrary, [I hope] to encourage the study of these forms 
from a mathematical standpoint In the meantime, just as a poet has a license 
from the rules of grammar and pronunciation, we should like to ask for 
"physicists' license" from the rules of mathematics in outer to express what 
we wish to say in as simple a manner as possible. 

Exponential Population Growth 

To illustrate the nature of mathematical models (as well as some of their lim- 
itations) we consider the dynamics of population growth, a subject of con- 
siderable interest in population genetics and population biology In Figure 
1.7, the solid dots show the increase in the number of cells of the yeast 
Safcharomyces ccrevisiae in a defined quantity of culture medium, lhe num- 
ber of cells increases slowly at first (0-4 hours), then more rapidly (hours 
4-12), then more slowly again (hours 12-18). As a first approximation of the 
early stages of population growth, we may assume that a constant fraction 

28 Chapter 1 

Figure 1 .7 Increase in the number of cells of the yeast Saccharowyces cere- 
visiae in a defined quantity of culture medium (dots). The smooth curves are 
made from mathematical models of exponential growth or logistic growth. 
(Data from Pearl J 927.) 

of the cells reproduces in each interval of time. To simplify matters further, 
we will assume that the population size does not change gradually but 
changes in a discrete and instantaneous "jump" at the end of each hour. A 
model of this type is a discrete model of population growth. Thus, we may 


/V, = /V M + Mr f _, 


where N, and W M represent population size at the end of hours f and t - 1 and 
where r is a constant called the intrinsic rate of increase equal to the fraction 
of cells that reproduce in each interval of time. This equation says that the 
population size at the end of hour f is the sum of two components: (1) all the 
cells present at the end of hour t - I (which means that none of the cells die), 
and (2) the progeny of the WV,_, cells that divided in the interval. 

Equation 1.4 illustrates a feature of theoretical population genetics that 
sometimes leads to confusion: the same symbols are often used for different 
things. In this equation, r is the intrinsic rate of increase in population number. 
In other equations in population genetics, r is the recombination fraction 
between two genes linked in the same chromosome. The symbol r is used for 

Genetic and Statistical Background 29 

still other parameters also. Any possible confusion could be avoided by 
indicating each parameter with a different letter; this solution is impractical 
because one quickly runs out of letters, even including Greek letters. Another 
way is to distinguish different meanings of the same lettei by typography, the 
use of superscripts, subscripts, and so forth. The problem with this approach 
is that even simple equations get to look imposing. Still another solution, which 
is the one adopted in this book, is to ask the reader to play close attention to the 
context so that, for example, r as used in the context of population growth is 
not confused with r used in the context of genetic linkage and recombination. 
The solution to Equation 1.4 is straightforward. Because N, = (1 + r)N t ^, it 
follows that N,_, = (1 + r)N t _ 2 . Consequently, we can write N, = (1 + r)(l + r) 
N f _2 = {1 + r} z N,_ 7 . However, N,_ 2 = (1 +■ r)N t _ 3 , and so N, = (1 + r) 3 N,_ 3 . Continu- 
ing in this manner, we eventually deduce that 

JV,=(l + r)'JV 


For the data in Figure 1.7, if we set N = 10 (the observed number) and 
r = 0.7083, the first few points from Equation 1 .5 (indicated by crosses) fit 
very well — N = 10, N L = 17, N 2 = 29, N 3 = 50. Then the model starts to break 
down: N 4 = 85, N 5 = 145, N 6 = 249, and thereafter the fit becomes very bad 
indeed. The lesson from this example is that many models have a range over 
which they are reasonable approximations to the real world, in this case, for 
a short time after a yeast culture is inoculated. If the model is extrapolated 
beyond its range of validity, it yields nonsense The problem for many mod- 
els in population genetics is that their range of validity is unknown. 

In Equation 1 .5, N is defined only for f equal to positive integers because 
of the discrete nature of the model. Population growth is actually a continu- 
ous process. Population size increases gradually rather than in jumps. The 
continuous-growth version of Equation 1.5, shown by the dashed line labeled 
"exponential curve" in Figure 1.7, is given by 

JV(0 = JV(0>*' 


where r - In (1 f r) The rationale for Equation 1.6 is based on the same sort 
of argument as Equation 1.4 but compressing the time scale. Whereas 
Equation 1 4 assumes that each unit of time is one hour, suppose that each 
time unit were, say, one minute. In slowing down the time scale in this man- 
ner, we must also decrease the value of r, otherwise too many organisms 
would reproduce in each unit of time. Therefore, by analogy with Equation 
1.4, we can write N, - N M = r Q N,^ r but here r is the intrinsic rate of increase 
in the new time scale. If N(t) is a smooth, continuous function and not 
changing too fast, then it is easy to convince yourself that N, - JV M should 
approximate the derivative of N(f), which is the change in N(t) in a small 

30 Chapter 1 

interval of lime, and thai N,_, should be close to N(f) because we have 
assumed that N(i) is not changing very fast in (he new time scale. Therefore, 
we can write 


= r N(t) 




Because dlnN(t) = d N{t)/N(f)dl, where In is the base of natural logarithms, 
the solution of Equation 1.8 is In W(f) = r t + C, where C is a constant chosen 
so thai N(t) = N(0) when f = (Hence, C = In N(0) ) Expressing Ihe solution in 
terms of N(t) rather than JnN(f) yields Equation 1.6 Furthermore, comparing 
Equation 1 6 with Equation 1.5, it is clear that 

/V(0>">' =(\ + r)'N 


and therefore r = In(] + r) is the relation between the parameter r n in the con- 
tinuous model and the parameter r in Ihe discrete model. Equation 1.6 is the 
exponential function plotted In Figure 1.7 with N(0) = 10 and r Q = 0.5355 

PROBLEM 1.5 Under optimal culture conditions, the bacterium 
Escherichia colt can double in population size every 20 minutes. 
Because population growth is continuous, Equation 1.7 is the appro- 
priate model. A single cell of E. coli is cylindrical In Shape and has a 
volume of approximately 1.6 |im 3 {L6 x 10~ ia cm 3 ). A standard soccer 
ball has a diameter of 22 cm (roughly 9 inches) and a volume of 
approximately 5600 cm 3 . 

{a) What intrinsic rate of increase r per minute results in a dou- 
bling time of 20 minutes? 

(b) Starting with a single cell of E. colt growing under optimal con- 
ditions, how Long would it take to produce enough cells to fill 
one soccer ball? 

(c) How many soccer balls could be filled with cells after 24 hours 
of unrestricted growth? 

ANSWER 0) Set N(20) = 2N(0) = N(Q) exp (r„ x 20), where exp (•) 
stands for e { K Therefore, r n = (In 2)/ 20 = 0.034657. {b) One soccer ball 

Genetic and Statistical Background 31 

full of cells equals 5600/(1.6 x 10" 1Z ) = 3.5 x 10 1S cells. The time needed 
to produce this many cells is given by f = [In (3.5 x 10 15 )]/r = 1032.7 
minutes (17.2 hours), (c) After 24 hours (1440 minutes) of unrestricted 
growth, one cell yields exp (r x 1440) = 4.7 x 10 21 cells, which would 
fill more than 1.35 million soccer balls, (Note; If your answers to this 
problem are a little different from those given, it is probably because 
the numbers given were calculated to nine significant digits before 
rounding off.) 

Logistic Population Growth 

The calculations in Problem 1.5 indicate that no real population ran grow 
exponentially for more than a relatively small number of generations with- 
out catastrophic consequences. In nature, although lactors such as disease 
and predation often contribute to the control of population size, populations 
thai grow too large ultimately must deplete the available resources The kind 
of growth curve in Figure 1.7 is typical for populations expanding in a new 
environment: the initial population growth is exponential, but then Ihe rate 
of growth gradually decreases. 

A simple alternative to exponential growth is Ihe logistic model; the term 
logistic refers to proportions and, in the logistic model, the rate of population 
growth is assumed to decrease in proportion to the population size. By anal- 
ogy with Equation 1 .4, the change in population size with a disciete model of 
population growth takes the form 


JV, = JV,-i+'^-i 


In this equation, K is a constant known as the carrying capacity of the envi- 
ronment. Observe that, when N is very small compared with K, (hen N, ~ 
hl,_ , + rW ( _ ,, and so population growth is nearly exponential On the other hand, 
when N is close lo K, then N, - N,_j, and so population growth comes to a stand- 

Unlike Equation 1.4, Equation 1.10 does not have a simple solution for N, 
in terms of N, n . However, if the populalion grows sufficiently slowly then 
population growth ran be treated as continuous, and Equation 1 10 yields the 
differential equation 


= rNU$ 


The solution of Equation 1 11 is given by 


JV(/) = 

I + Ce 

1 12 

32 Chapter 1 

where the constant C - (K - N )/N tt Equation 1.12 is called the logistic 
growth curve and it is derived in Problem 1.7 below. Logistic population 
growth results in a sort of S-shaped curve like that shown in Figure 1.7, 
where the parameters are r - 5355, N a = 1 and K = 665. (Note that the r and 
No parameters are the same as in the exponential-growth model for the same 
data ) The fit is obviously very good indeed. 

PROBLEM 1 .6 Use Equation 1.12 with N = 10, r = 0:$355, and K *\ 
665 to calculate N(i) for the times t = 7 and 8 and t * 13 and 14. What ] \ 
are the values of r in Equation 110 for t - 8 and f = 14? Why are they > 
not equal to 05355? Why are they not equal to each other? 

ANSWER With the given parameters, N(7) * 261:53, N{$) * 349.43, 
N(13) = 626.13, and N(14) * 641.68. Solving Equation l.lG for r and 
substituting &{f) yields r * 0.5540 for t = 8 and r e 0.425 for t * 14 Nei- 
ther of these values agrees with r = 0i5355> nor do they agree with 
each other, because Equation 1J0 pertains to a discrete Model and ; 
Equation 1.12 tq a continuous niod«!l. Wheh the posteriori g*OWs, 
continuously, the value of r needed to produce a given change in pbp 
ulation stee in soitte discrete interval of time ''dm&V&j&dffl&wt,. . 

magnitude of the change in population &ix&i 

. .^ V r VV ^ v ^ /) ^ v . y 


PROBLEM 1.7 Use fbe expression fii/X(a i bx))$x 'irf -fl ) 
to derive the logistic growth curve from Equatidfi 1.11. 

ANSWER Write Equation 1.11 as ^(t}/ W#- N|i)$ ¥%K^ 1&fo. :'' 
comparing with the integral form, it is deaf ' 'Mat .'if * IC ^sd 
fr = -1 . Integrating both sides in accordance with the t6^a^k ftesults in i 
~(1/K) In [K - N(t)]/N(t) = rt/K + Cttsf , where c«sr is a conMaftf of iniegra- ' 
tion chosen so that N(t) = N(0) when i = 0. Hence, c«$* * -(l/K)' In 
|K - N(0)]/Kf(0) = -(1/K) In C, where C is the constant appearing 
in Equation 1.12. Consequently, In [K - N(t)\/N(t) * :-** '+ C, attd so 
\K - N(t)]/N(t) = C exp -rf. Equation 1,12 follows after some simpiifi- , 
cation. ; 

Genetic and Statistical Background 33 


Population genetics is the application or Mendel's laws and other genetic 
piinciplcs to entire populations of organisms It includes the study of genet- 
ic variation within and between species and attempts to understand the 
processes resulting in adaptive evolutionary changes in species through 
time Population genetics has many practical applications in medicine, agri- 
culture, conservation, and other fields. 

A gene is a hereditary determinant transmitted from parent to offspring 
that influences a hereditary trait, often in combination with other genes and 
also with the environment. Alleles are alternative forms of a gene. Genotypes 
are formed from pairs of alleles and are either homozygous (if the alleles in 
the genotype are the same) or heterozygous (if the alleles are different). The 
physical or biochemical characteristics of an organism constitute its pheno- 
type. The essential mechanism of genetic transmission was established in 
experiments by Gregor Mendel in the years 1856 to 1863 Mendel showed 
that the alleles of each gene separate (segregate) from one another in the for- 
mation of reproductive cells or gametes. Genes are arranged in linear order 
along chromosomes. A chromosome may contain several thousand genes. 
Alleles of different genes present in the same chromosome tend to be inherit- 
ed together (linkage), but the allele combinations can be broken up bv recom- 

Chemically, a gene is a region of a DMA molecule. DNA is a metaphorical 
"twisted ladder" consisting of two paired strands composed of polymers of 
nucleotides (the sidepieces of the ladder) whose bases (eithei A, T, G, or C) 
jut inward from the sidepieces to form the rungs. Each rung of the ladder 
consists of either an A-T base pair or a G-C base pair. Most genes code for 
the polypeptide chains of proteins through a transcript of RNA that is 
processed into the messenger RNA (mRNA). The polypeptide is produced 
stepwise by translation of the mRNA according to a triplet genetic code, in 
which each nonoverlapping group of three adjacent bases (a codon) specifies 
the amino acid to be attached to the growing chain. Alleles differ in their 
sequence of nucleotides A nucleotide substitution in the third position of a 
codon may not result in an amino replacement in the encoded polypeptide 
because of redundancy in the genetic code However, most nucleotide substi- 
tutions in either of the first two positions do result in amino acid replace- 

A probability is a number between and 1 that measures the likelihood of 
a particular event being realized in an actual or conceptual experiment The 
addition rule applies to mutually exclusive events and states that the proba- 
bility of one or the other event being realized equals the sum of the separate 
probabilities. The multiplication rule applies to independent events and 
states that the probability of both events being realized simultaneously 
equals the product of the separate probabilities. The probabilities of various 

34 Chapter 1 

Genetic and Statistical Background 35 

outcomes of repeated and independent trials can be deduced by application 
of the addition and multiplication rules and conforms to successive terms in 
the binomial expansion (p> + q)". 

Natuial populations contain genetic variation in the form of multiple alle- 
les of ninny genes For any specified allele, the allele frequency is the propor- 
tion of all alleles of the gene that are of the specified type. The allele 
frequency in a population must usually be estimated from a sample, and so 
there is variation in the estimate from one sample to the next. The variation is 
quantified by the standard error. If the distribution of the estimates conforms 
to a normal, bell-shaped distribution, then the proportions of the estimates 
lying within + \,±2, and ± 3 standard deviations of the true value of the 
parameter are 68%, 95%, and 99 7%, respectively. Estimates are also often pre- 
sented as a confidence interval, which expresses the degree of confidence that 
the true value of a parameter lies in some specified interval. 

A model is a deliberate simplification of a complex situation. Models may 
be experimental or conceptual. Conceptual models may be verbal, computa- 
tional, or mathematical Mathematical models are widely used in population 
genetics, They specify the mathematical relations between measured or mea- 
surable quantities that determine the changes in allele frequency in popula- 
tions. Population growth affords an example of mathematical modeling. In 
the simplest model of discrete population growth, at discrete times a constant 
fraction of the population reproduces, and so the population jumps instanta- 
neously from one size to the next. A more realistic model envisages continu- 
ous reproduction through time, in which case population growth is 
exponential The exponential model often fits population growth in newly 
colonized environments when the population density is low Population 
growth is ultimately limited by nutrients, space, or other resources. When 
population growth decreases in proportion to population size, the S-shaped 
logistic curve of population growth results; this curve is determined by the 
intrinsic rate of increase r and the carrying capacity of the environment K, 


I . If you were to catch a collection of Drosophila, grind each one individual- 
ly in a buffer solution, and measure the rate at which this crude whole-fly 

homogenate catalyzed the reaction for glucose-6-phosphate dehydroge- 
nase, you would find that the activities would vary by more than four 
fold. Make a list of possible causes of this variation 

2 Given the complexity of causes of variation in Problem 1 , how much vari- 
ation would you expect to see in the underlying genetic cause of a human 
inborn error of metabolism such as phenylketonuria? This disorder is 
caused by insufficient activity of phenylalanine hydroxylase. 

3 There are 64 codons in the genetic code, and each codon can undergo 
nine single-site mutations (each base can mutate to three other bases), for 

a total of 576 mutations. How many of these result in no change in (he 
"meaning" of the encoded sequence? 
4 Assuming that all nucleotides in all codons mutate with equal Irequency 
(i e , that all 576 mutations in Problem 3 occur at the same rate), are muta- 
tions from one ammo acid to another all equally likely? 

5. The correspondence between genotype and phenotype is one of the most 
complex and difficult aspects of evolutionary genetics. Describe an exam- 
ple of a gene whose mutations cause more than one distinctly different 
phenotype that do not appear lo be related. 

6. A population cage of Drosophila melm-wgaster is started with 50 males and 
50 females, all having the genotype (e sf)/(c + si') This notation implies 
that one chromosome has the e and sr mutations, and the other has the 
wild type allele at both loci. These two loci show a frequency of recombi- 
nation in females of r = 037, and the males produce only non-recombi- 
nant gametes, Calculate the expected frequency of the gametes for both 
males and females and the expected offspring genotype frequencies 

7. In some human cultures it is very important to have n son and a daugh- 
ter, and couples continue having offspring until they have one of each. If 
an entire population followed this rule, what would happen to the sex 
ratio in the population? 

8. If two genes are on different chromosomes, the probability that a gamete 
has a particular allele of each of the two genes is the product of the prob- 
ability of drawing each allele because the draws are independent of one 
another (see the multiplication rule). If each gene is on a dilferent chro- 
mosome, what is the chance that genotype An Bb CC Dd produces two 
consecutive gametes that are A BCD? 

9. If individual X has an autosomal recessive disease and both parents are 
unaffected, what is the chance that the sibling of X is a heterozygous car- 

10. A line of mice seems to consistently produce 55% male and 45% female 
offspring. In order to test whether this deviation is significant, how many 
offspring would you have to could to be able to reject a 50 ■ 50 sex ratio at 
a probability of a = 0.05? (Assume that the sex ratio of the mice remains 
55 45) 

11. A species of butterflies occurs in two distinct morphs, A and B. You sam- 
ple two areas and count 26 A and 28 B butterflies in one area, and 10 A 
and 21 B in another area Is it possible that these two samples could come 
from a single homogeneous population, or are the frequencies of the two 
morphs significantly different from one another? 

12 Levy and Levin (1975) used electrophoresis to study the phosphoglucose 
isomerase-2 gene in the evening primrose Oenothera biennis, a complex 
genomic helerozygote made true breeding by chromosomal transloca- 
tions. They observed two alleles affecting electrophoretic mobility ol the 


36 Chapter 1 

enzyme, and among 57 strains they found 35 PGI-2n?PGl-2(i, 19 PGl- 

2a/PGI-2b, and 3 PCI-2b/PGI-Zb genotypes 

a Calculate the allele frequencies of PGI-2n and PGI-2b. 

b. With random mating, what would be the expected numbers? 

13 The simple models of population growth fail to take into account many 
factors (hat affect rates of change. The global human population al a.d., 
200 A D., and at intervals of 200 years up lo the present has been estimat- 
ed in millions of people as 200, 200, 200, 200, 250, 280, 350, 400, 550, 980, 
and 6000. If the population were growing exponentially, these points 
would fall on a straight line when plotted on a logarithmic scale. Draw 
this plot. What do you conclude? 

14 A healthy pair of Drosoptriln can produce 500 offspring in 12 days, each 
adult fly weighing about 1 mg. Assume that the parental flies die after 
they finish reproducing. (Actually, they Jive about a month.) It all succes- 
sive generations get enough to eat and remain this fecund, what will the 
mass of flies be in one year? 


Genetic and Phenotypic Variation 

Phenotypic Variation Normal Distribution Mendeuan Variation 

Protein Polymorphisms • DNA Polymorphisms • Multiple-Factor Model 

enetic variation in populations became a subject of scientific 
inquiry in the late nineteenth century prior even to the rediscov- 
ery of Mendel's paper in 1900. The leading exponent of the study 
of hereditary differences among human beings was Francis Galton 
(1822-1911). Galton was a pioneer in the application of statistics to biology. 
Me used statistical methods to study physical (raits such as eye color and fin- 
gerprint ridges as well as behavioral traits such as temperament and musical 
ability. Galton was among the first to examine the statistical relations 
between the distributions of phenotypic traits in successive generations. He 
is regarded as the founder of biometry, the application of statistics to biologi- 
cal problems. 


Galton and Mendel exemplify opposite approaches to the study of inherited 
traits. Mendel's point of departure in the study of genetics was discrete vari- 
ation, in which phenotypic differences among organisms can be assigned to 
a small number of clearly distinct classes, such as round versus wrinkled 
peas. Gallon's point of departure was continuous variation, in which the 
phenotypes of organisms are measured on a quantitative scale, like height or 
weight, and in which the phenotypes grade imperceptibly from one catego- 
ry into the next As material for the study of phenotypic variation, Gallon's 
choice was good: most of the differences among noimal people that are vis- 


38 Chapter 2 

lble to Ihe unaided eye are differences in continuous traits — height, weight, 
skin color, hair color, facial features, running speed, shoe size, and so forth. 
The same is true of phenotypic variation in other organisms. On the other 
hand, as material lor the study of genetic variation, Mendel's choice was 
good' The pattern of segregation of alleles is revealed most clearly in pedi- 
grees of discrete, simple Mendelian traits. 

Continuous Variation: The Normal Distribution 

With continuous traits, not only do the pheno types grade into one another, 
but the traits also usually present difficulties for genetic analysis. The prob- 
lems are of two principal types - 

• Most continuous traits are influenced by the alleles of two or more genes, 
hence the segregation of any one gene in pedigrees is obscured by the 
segregation of other genes that affect the trait. 

• Most continuous traits are influenced by environmental factors as well as 

by genes, and so genetic segregation is obscured by environmental 

These problems are not insurmountable in organisms with a sufficiently 
high density of genetic markers scattered throughout the genome (the com- 
plement of chromosomes) because the genetic markers can be tracked in 
pedigrees along with the continuous trait of interest. Organisms with suffi- 
ciently dense genetic maps include human beings, laboratory animals, and 
many domesticated animals and crop plants. 

In Gallon's time, however, studies of continuous traits based on genetic 
linkage were unknown. Why, then, did Galton focus on continuous traits? 
Because they have a sort of regularity — a statistical predictability — of their 
own For many continuous traits, when the phenotypes are grouped into 
suitable intervals and plotted as a bar graph, the distribution of phenotypes 
conforms closely to the normal distribution, the symmetrical, bell-shaped 
curve discussed briefly in Chapter 1 in the section on phenotypic diversity 
and genetic variation. For example, a bar graph of Gallon's data on the 
heights of 1329 men, rounded to the nearest inch, is plotted in Figure 2.1. 
The smooth curve is the normal distribution that best fits the data. The equa- 
tion of ihe normal curve is* 

2a 2 


where jr ranges from -°° to +°°, and tc = 3.14159 and e = 2.71828 are constants 
The location of the peak of the distribution along the x axis is determined by 
the parameter u, which is the mean, or average, of the phenotypic values, 
The degree to which the phenotypes are clustered around the mean is deter- 

Genetic and Phenotypic Variation 39 

N = 1329 



.v = 69 
" a = 2 5 











<63 64 65 66 67 68 69 70 71 72 73 74 >75 
Height (rounded to (he nearest inch) 

Figure 2.1 Distribution of height among 1329 British men (Data from Galton 

mined by the parameter o 2 , which is the variance of the distribution. 
Mathematically, the variance is the average of the squared difference of each 
phenotypic value from the mean; that is, it is the average of the values of 
(i - pT How u and a 2 are estimated from data is considered next. 

Mean and Variance 

Because p and cr 2 are parameters, their values are unknown, and they must 
be estimated from the data themselves. The height data are tabulated m 
Table 2.1, in which/, is the number of men whose height is x„ rounded to the 
nearest inch. (The fact that the shortest and tallest men are grouped in the 
tails of the distribution makes no difference because these men account for 
only a small proportion of the total sample.) Also tabulated are the products 
/, x ,v, and/ x x, 2 as well as their sums. 

1 he mean p of the distribution is estimated as the mean of the sample, 
which is conventionally denoted x (also sometimes as fi). 



In this example, x = 91,639/1329 = 68.95 inches 

Likewise, the vaiiance o 2 of the distribution is estimated as the variance of 

the sample, which is conventionally denoted s 2 (also sometimes as cr> 



40 Chapter 2 

The expression in the middle follows directly from the definition of the vari- 
ance it is the average of the squared deviations from the mean because, for 
each value of *„ (x, - 3E) is the deviation of thai value from the mean The 
expression on the right is identical arithmetically but easier to apply in prac- 
tice. In the example in Table 2.1, s 2 = 6,326,939/1329 - (68.96) 2 = 6.11. (This 
value may differ slightly from your own calculation according to the num- 
ber of significant digits you carried along before rounding off.) If the sample 
size is small (say, less than 50), then a slightly better estimate of the variance 
is obtained by multiplying the expression in Equation 2.3 by n/(n - 1), where 
n is the total size of the sample (in this case, 1329). 

Closely related to the variance is the standard deviation of (he distribu- 
tion, which is the square root of the variance. The standard deviation is a nat- 
ural quantity to consider in view of the units of measurement. In Table 2.1, 
foi example, each measurement is in inches The mean is also in inches. How- 
ever, the variance, being the average of squared deviations, has the units of 
squared inches— which seems more appropriate for an area than for a height. 
Taking the square root of the variance restores the correct unit of measure: in 
this example, inches. The estimate of I he standard deviation is conventional- 
ly denoted s (also sometimes as d) and it is calculated as the square root of 
the quantity in Equation 2.3. In the height example, s = 2.47 (which may 

TABLE 2.1 





Number of 

interval (i) 

range (in,) 

inch (X;) 

men (f,) 

fj X Xj 

fix X, 








63 5-64 5 






64 5-65.5 






65 5-66.5 






66 5-67 5 






67.5-68 5 






68 5-69.5 






69 5-70 5 


J 98 




70 5-71 5 






71 5-72.5 






72 5-73.5 






73 5-74 5 






>74 5 









<ito ? ) 

Scunce Data fiom Gallon 1889. 

Genetic and Phenotypic Variation 41 

again differ slightly from your own calculation because of round -off error). 
The estimate s of the standard deviation is often called the standard error. 
When estimating a proportion — such as the frequency of an allele in a 
population — the standard error is calculated according to Equation 1 3 in 
Chapter 1 . 

In Chapter 1, the values 68%, 95%, and 99.7% quoted as the proportions of 
observations expected to fall within 1, 2, or 3 standard errors of I he mean, 
respectively, emerge directly from Equation 2.1 for the normal distribution. In 
a normal distribution, the exact proportion of observations falling with any 
specified range of x equals the integral of Equation 2.1 across the specified 
range. For the normal distribution, the integral between the limits u + a 
equals 0.6827, that between u ± 2cf equals 0.9545, and that between u ± 3a 
equals 0.9973. In data analysis, x and s are used in place of p and a. Inciden- 
tally, the integral of the normal distribution between the limits u ± 4a equals 
9999; this result says that fewer than one in 10,000 observations falls more 
than four standard deviations from the mean. 

Central Limit Theorem 

Galton was immensely impressed with the observation that many natural 
phenomena follow the normal distribution. He writes: 

I know of scarcely anything so apt to impress the imagination as the wonder- 
ful form of cosmic order expressed by the "law of frequency of error" (the nor- 
mal distribution] Whenever a large sample of chaotic elements is laken in 
hand and marshaled in the order of their magnitude, this unexpected and 
most beautiful form of regularity proves to have been latent all along. The law 
would have been personified by the Greeks if they had known of it It reigns 
with serenity and complete self-effacement amidst the wildest confusion. The 
larger the mob and the greater the apparent anarchy, the more perfect is its 
sway ft is the supreme law of unreason. 

It is, indeed, remarkable to consider that pure, blind chance is the reason for 
this "unexpected and most beautiful form of regularity." 

The theoretical basis of the normal distribution is known in probability 
theory as the central limit theorem. Roughly speaking, the central limit the- 
orem states that the sum of a large number of independent random quanti- 
ties always converges to the normal distribution. For our purposes, 
"independent" in this context means that information about any one of the 
observations gives no improvement in the ability to predict any other of the 
observations A large number of independent random quantities is appar- 
ently what Galton meant by "a large sample of chaotic elements." The central 
limit theorem explains in part why so many continuously distributed traits 
conform to the normal distribution. Most continuous traits are multifactorial, 
meaning that they are influenced by "many factors," typically several or 
many genes acting together with environmental factots. Among human 

42 Chapter 2 

Genetic and Phenotyprc Variation 43 

beings, for example, the obvious differences between normal people in hair 
color, eye color, skin color, stature, weight, and other such traits are not usu- 
ally traceable to single genes. They result from the combined effects of sever- 
al or many genes as well as numerous environmental effects acting together 
as "a large sample of chaotic elements/' which often produce, in the aggre- 
gate, a normal distribution of phenotypes. 

It should be emphasized that the "large number" of random elements 
specified in the central limit theorem need not be excessive. As an example, 
Figure 2.2 is a bar graph of 100 observations in which each "observation" con- 
sists of the sum of nine consecutive random numbers chosen with equal 
probability from anywhere in the range (-1, f 1 ) For the sum of nine random 
numbers in this range, the theoretical mean equals and the theoretical stan- 
dard deviation equals 1.73; the sample values were x = -0.12 and s = 1 .70. 
Expressed as a deviation from the mean in multiples of the standard error, 
the number of observations in each category is shown at the top of the bar in 
Figure 2.2. Because the expected numbers are 2.5, 13.5, 68, 13 5, and 2.5, the fit 
to a normal distribution is obviously very good In this example, therefore, 
fewer than 10' "chaotic elements," when added together, yields "this unex- 
pected and most beautiful form of regularity." 


| 50 


S 40 


o 30 


B 20 





-2 -1 +1 +2 

Deviation from mean {+SE) 

Figure 2.2 Distribution of 100 values of the sum of nine random numbers 
from the interval (-1, +-]). 

PROBLEM 2. 1 At an International Health Exhibition in London iin 
1884, Gallon set up an "anthropometric laboratory" that carried out 
tens of thousands of measurements covering a wide range of human 
traits. Among the traits was "strength of pull," expressed as the num- 
ber of pounds that a person could pull with One arm against a resist- 
ing force in a sort of arm-wrestling contraption (Galton 1889). The 
data for 519 males aged 23-26 years fell into the following categories 
(the number in parentheses is the number of males in each category): 
40-50 lbs (10), 50-60 (42), 60-70 (140), 70-80 (168), 80-90 (113), 90-100 
(22), 100-110 (24). Using the midpoint of each category as the strength 
of pull for all males in that category, estimate the mean and standard 
deviation of strength of pull. Assuming that strength of pull has a 
normal distribution with parameters equal to these estimates, what is 
the expected proportion of males whose strength Of pull exceeds 112 

ANSWER The values of x ; are 45, 55, 65, and so forth. Then X/j = 519, 

Ifa = 38,675, and I/x, 2 = 2,963,375. Hence, x = 74.5 lbs, s 2 = 156.8 lbs 2 , 

and so s = 12.5 lbs, (Answers may differ slightly because of round -off 
error.) A strength of pull of 112 lbs is three standard errors above the 
mean; hence a proportion of only (1 - 0.997)/2 = 0.0015 (about one in 
667) males is expected to have a phenotype exceeding this value. 

Discrete Mendelian Variation 

Discrete Mendelian variation (also called simple Mendelian variation) refers 
lophcnotypic differences resulting from segregation of the alleles of a single 
gene. Environmental effects on the trait are small enough, relative to hered- 
itary differences, that the transmission of alleles determining the trait can be 
traced through pedigrees. An example of discrete Mendelian variation is the 
inheritance of red, pink, or white flower color in snapdragons (Chapter 1). 

his case is exceptionally convenient for genetic studies because of the inter- 
mediate phenotype of the heterozygote However, most of the phenotypic 
vanaluin in natural populations is multifactorial In human beings, for exam- 
ple, although simple Mendelian variation accounts for many inherited dis- 
orders, each of ihc disorders is relatively rare. 

honkally, simple Mendelian varialion is more easily detected by studying 
genes and I heir products than by studying phenotypes. Because the mecha- 
nisms of transcription, RNA processing, and translation are relatively free of 

gene interactions and environmental effects thai complicate the analysis 


44 Chapter 2 

of multifactorial traits at the phenotypic level, there is a direct connection 
between DNA sequences and alleles and a nearly direct connection between 
genes and their products Indeed, the correspondence between DNA 
sequences and alleles is one-to-one' different alleles have different DNA 
sequences irrespective of whether the alleles affect phenolype. Likewise, alle- 
les with nonsynonymous cod on differences in a protein-coding region result 
in different amino acid sequences irrespective of what the polypeptide does 
in metabolism or how the difference in sequence affects the organism. 

Hence, an efficient way to detect simple Mendelian variation is to study 
molecules — and therein lies a paradox. As evolutionary biologists, popula- 
tion geneticists are interested in observable phenotypes that are likely to be 
subject to natural selection: morphology, rate of development, mating behav- 
ior, age of reproduction, longevity, and so forth (in short, the types of traits 
that attracted Galton). On the other hand, genetic studies are most readily 
carried oul with simple Mendelian variation detected as differences between 
molecules. The paradox is that differences in molecules among healthy 
organisms are not usually related in any obvious way to differences in phe- 
notype Thus, there is a gap in being unable to specify exactly which types of 
molecular differences underlie the evolutionary process. The irony of the sit- 
uation is similar to that described by the physiologist Albert Szent-Gyorgyi: 

My own scientific life was a descent from higher to lower dimensions, led by 
the desire to understand life I went from animals to cells, from cells to bacte- 
ria, from bacteria to molecules, from molecules to electrons. The story had its 
irony for molecules and electrons have no life at all On my way, life ran out 
between my fingers 

The gap between genotype and phenotype results from the complex inter- 
actions between genes and environment in the determination of physiology, 
development, and behavior In evolutionary biology, the complexity is even 
greater because the key issue is the relative ability ol organisms to survive and 
reproduce in their environments. Nevertheless, the disconnect between dif- 
ferences in molecules and evolutionary adaptations is by no means inevitable, 
permanent, or insurmountable. It is already clear that the study of the relation 
between genetic variation and evolutionary adaptation must be high on the 
agenda of evolutionary biology lor the next century, and already there are 
many examples in which the relation is quite well established. 


For nearly 50 years, the workhorse method for revealing genetic variation 
has been electrophoresis because small differences in rate of migration in an 
eleclrophoretic field can be used to distinguish between nearly identical 
macro molecules A typical laboratory setup for electrophoresis is illustrated 

Genetic and Phenotypic Variation 45 

Bands (visible after 
sinlable lieatmenl) 

- -^ 

'T 1 


I *> 


[ a 


Power supply 


Figure 2.3 One type of laboratory apparatus for electrophoresis. The proce- 
dure is widely used to separate protein or DNA molecu les In conventional eels 
DNA fragments smaller than about 20 kb migrate approximately in proportion ' 

to the logarithm of their molecular weights. 

schematically in Figure 2.3. The tray contains a thick layer of a gel, typically 
starch, acrylamide, or agarose; it may be placed horizontally (as shown in the 
illustration) or vertically (with the gel sandwiched between two glass plates) 
Each sample of material is placed in a small slot near the edge of the gel 
Connected to each edge of the gel is a chamber containing a buffered solu- 
tion and electrodes. In electrophoresis, an electric current is applied across 
the gel for several hours. Molecules in the samples-usually proteins or 
nucleic acids are of greatest interest-move through the gel m response to 
the electric field Molecules of different size and charge move at different 
rates. After the electrophoresis is finished, the positions of the molecule or 
molecules of interest are revealed by any of several procedures. 

Protein Electrophoresis 

In protein electrophoresis, used primarily to study enzyme molecules, the 

position to which a particular enzyme migrates is revealed by soaking the 
gel in a solution containing a substrate for the enzyme along with a dye that 
precipitates where the enzyme-catalyzed reaction takes place. A dark band 
thus appears in the gel at the position of the enzyme. If the enzvme present 
m a sample has an amino acid replacement that results in a difference in (he 
overaH .onic charge of the molecule, then the cn/yme will have a somewhat 
altered electrophoretic mobility and move at a different rate. The elec- 
trophoretic mobility changes because enzymes of the same size and shape 
move at a rate determined largely by the ratio of the number of positively 
charged amino acids (primarily lysine, arginine, and hi.stidine) to the num- 

46 Chapter 2 

be i' of negatively charged ones (principally aspartic acid and glutamic acid). 
Electrophoresis can therefore be used to detect a mutation that results in a 
difference in elechophoretic mobility of the enzyme it encodes. 

One possible result of an electrophoresis experiment is shown in the 
hypothetical gel in Figure 2.4A, in which all samples manifest an enzyme 
with the same electrophoretic mobility The result indicates a monomorphic 
sample because there is only one electrophoretic pattern observed. Another 
kind of result is shown in Figure 2.4B, in which polymorphism is observed 
in the types of electrophoretic patterns. When polymorphic enzyme bands 
are observed, genetic tests typically indicate that organisms with only a 
fast-migrating enzyme are homozygous for a fast allele (F/F) and those 
with only a slow-migrating enzyme are homozygous for a slow allele (S/S) 
Organisms with both enzyme bands are heterozygous for the alleles {F/S'j. 
Simple Memdelian inheritance of the polymorphism is indicated by, for 
example, the finding that matings of two heterozygotes produce, on the 
average, V 4 F/F, V 2 F/S, and % S/S progeny. Two enzyme bands appear in 
heterozygotes whenever the active enzyme consists of a single polypeptide 
chain {rather than two or more polypeptide chains aggregated together) 
because heterozygotes produce a different polypeptide chain from each 

Enzymes that differ in electrophoretic mobility as a result of allelic differ- 
ences in a single gene are called allozymes. Hence, allo/yme variation in a 
population is an indication of simple Mcndelian genetic variation. Allozyme 
variation is widespread in almost all natural populations studied by 

(A) Monomorphic sample 

(B) Polvmoi phic sample 

! f t S 

S I S 5 



' s 












r » 



Figure 2.4 Monomorphism and polymoiphism (A) Hypothetical gel showing 
protein monomorphism All samples have an enzyme with the same elec- 
trophoretic mobility. (B) Hypothetical gel showing allozyme polymorphism. 
Eight samples arc homozygous foi an allele (F) that codes for a rapidly migrat- 
ing enzyme; two samples are homozygous for a different allele (S) that codes for 
a slowly migrating enzyme, and six samples are heterozygous (F/S) and there- 
fore exhibit enzyme bands corresponding to both alleles. 

Genetic and Phenotypic Variation 47 

electrophoresis, including organisms such as bacteria, plants, Drosophiln, 
mice, and human beings. 

' MfO&LEM 2.2 A sample of 35 orgahisms from a Texas population of 
t ' the wild annual plant Phlox drummondii were examined for the elec- 
t frdphoretic mobility of the eirtzyme alcohol dehydrogenase (Levin 
I; 1978). Two alleles affecting electrophoretic mobility were found— 
Adh* and Adk h . The genotype frequencies observed in the sample 
^OMAdfe/Mh A ,032AdH*/Adh h , and 0.64 Adh h /Adh h . Estimate 
the allele frequency of Adf? and its standard error. 

ftN$yytR " 'iMp rtspfeseni'ty allele frequency of Ad\f. Then p = 0.0 4 
+ 0.32/2 *» &20. The standard error eoualg V(020)(l - 0.20) /(2 x 35) « 
0.05. ■ ■ , ; ':■: , ■ ■■ 



l'>\$Q&tki£tA iL% Frofn a natural population of Drosophita nieltmogaster 

^jftl 8&lid#i, North CaMiha,! 660 fertilized females were trapped and 

med to found a large laboratory population {Mukai et al. 1 974), After 

> tyfookt five months (TO generations), 489 third chromosomes in the 

Of&I&Hon were examined for allozymea coding for the enzymes 

^/•-WjMMie^ (allele*! E6 F and E6 S ), esterase-C (alleles EC F and EC S ), and 

[';;*$SfciiM dehydrogenase (allele* Qrfft* and Odh s ). The order of the 

- gene* in the third chromosome is known to be E6-EC-Odh, The 

results were as follows: j 

I E6 f EC f Od}/\[52 ,E6 s EC F Odh F 264 

E6 f EC f O(T/i s 7 
E6 r EC s 6d^ 15 

E6 S EC'' Odh s 13 
E6 S EC S Odh F 29 

.,,"!'. ; " . EtfEtf Odh s _ 1 , , ' £6 S EC S Odh s 8 

Estimate the allele frequencies and their standard errors for E6 f and 
E6 S , for EC* and EC S , and for CMf» F and Qdh s . What number of each of 
the chromosome types ^expected assuming that the alleles are asso- 
ciated at random? : 

48 Chapter 2 

AN SWER For esterase-6, there were 1 75 E6 f and 314 £6 S allele, 
yielding p « 175/489 * 0.358 for E6 f and <f = 314/489 * 0.642 for 
E6 S ; the standard e rror is the same for both estimate, diftd s^Mts 
V(0.358)(0.642)/489* 0.022. For the other allele^ ffte/fesMfcite& afM 
their standard errors are 0.892 ± 0.014 for £C f and 0i# il 
EC 5 ; and 0.941 ± 0.011 for Ot^ and 0.059 ± 0.011 for tSflftj 
random combiflatiofis, ttie expected ritimfc& tf Writs'* 
type equals thebrodtfct of the allele rteqtifflfeiet ttihtft m 
pie, for £6 F ECOdh r , <he ejected rWmbeft Is (Jl39ft*&ft&fc I 
489 = 146.8. The expected numbers (observed in pufciijfittb0 
eight chromosome types are: 146.8 (152)> 263.4 $B$^&&$&&& 
17.8(15),3iO(mi.ia)^.0(8).Tr^«odeldfrJfMi8irl " ' " ' 
of alleles fits very wtlL 

■ 4 

The Southern Blot Procedure 

Like polypeptides, DNA fragments can be separated by electrophoresis. 
Unlike a polypeptide, which has a predetermined size according to the num- 
ber of amino acids it contains, a molecule of chromosomal DNA is random- 
ly sheared into fragments of various size during purification. Therefore, in 
any DNA preparation, the DNA fragments containing a particular sequence 
have a range of sizes depending on where on each side of the sequence the 
chromosomal DNA became sheared. Fortunately, there is a class of enzymes 
that cleaves DNA at particular sites along the molecule Consequently, when 
chromosomal DNA is cleaved with such an enzyme, each DNA fragment 
containing a particular sequence is cut at the same sites on either side and so 
will have the same length. 

The enzymes that cleave DNA at particular sites are called restriction 
enzymes. Each type of restriction enzyme cuts double-stranded DNA at all 
sites at which there is a particular nucleotide sequence called the restriction 
site of the enzyme. Examples of restriction enzymes and their restriction sites 
are shown in Figure 2.5; the cuts are made at the positions of the arrows. For 
example, the enzyme Alul cuts at sites of the four-nucleotide sequence 
AGCT, and EcoRI cuts at the six-nucleotide sequence GAATTC. Most restric- 
tion enzymes used in population studies have either four-nucleotide or six- 
nucleotide restriction sites. 

DNA is also unlike an enzyme in that it lacks any catalytic activity that 
can be used to determine the location of a band in a gel. On the other hand, 
any single strand of DNA is able to form a double-stranded molecule by 

Genetic and Phenotypic Variation 49 

Restriction enzyme 


Restriction site 










5'-GAATTC -3' 





5-CTCGAG -3' 


Figure 2.5 Restriction enzymes cleave DNA molecules at sites of specific, 
short nucleotide sequences. More than 500 different restriction enzymes are 
commercially available. They are essential tools in DNA analysis and gene 
cloning. The cleavage site in each DNA strand is indicated by the arrow. 

pairing with another strand having the complementary base sequence. This 
pairing of complementary DNA strands is the physical basis of the most 
widely used procedure for identifying DNA fragments in a gel; the proce- 
dure, illustrated in Figure 2.6, is a Southern blot. The reagent used for iden- 
tification is a molecule of DNA called the probe, which contains the 
nucleotide sequence of interest. Probe DNA is usually obtained from a gene 
that has been cloned (for example, into a bacterial cell) or by amplification 
with the polymerase chain reaction (described in the next section). Tn the 
Southern procedure, DNA restriction fragments that have been separated by 
electrophoresis are rendered single -stranded by soaking in a solution of sodi- 

50 Chapter 2 

Genetic and Phenotypic Variation 51 


DMA restriction 

(A) Blot 

(B) Hybridize filter 
with radioactive 
probe (Dark bands 
not visible at this 
stage ) 

(Q Photographic film 

exposed to filter 
Dark bands appear 
on film 

Figure 2.6 Southern blot procedure (A) DNA fragments separated by elec- 
trophoresis are transferred and chemically attached to a filter. (B) The filter is 
mixed with radioactive ptobe DNA, which sticks to homologous DNA mole- 
cules in the filter. (C) After washing, the filter is exposed to photographic film, 
which develops dark bands caused by radioactive emissions from the probe 

DNA in 
chromosomes <^ 



■ Probe DNA 

DNA bands 

Figure 2. 7 Restriction fragment length polymorphisms (RFLPs) result from 
the presence or absence of particular restriction sifes in DNA. In this example, 
the DNA molecule designated A contains three restriction sites, and the one des- 
ignated a contains four. Genotypes A A, Aa, and aa each yield a diffeient pattern 
of bands in Southern biol using the indicated probe DNA 

um hydroxide, then blotted onto a nitrocellulose or nylon filter where subse- 
quent chemical trealment attaches thern (Figure 2.6A). The filter is then 
bathed in a solution containing probe DNA that has been rendered radioac- 
tive (part B). As the solution cools, the probe DNA strands form double- 
stranded molecules with their complementary counterparts on the filter, and 
careful washing removes all of the probe DNA that has remained unpaired. 
The filter is sandwiched with photographic film, where radioactive disinte- 
grations from the bound probe result in visible bands (part C). Alternatively, 
the probe may be chemically modified and the bands visualized by fluores- 
cence or staining. 

Genetic differences resulting in the presence or absence of restriction sites 
can be identified because they change the length of characteristic restriction 
fragments. An example is illustrated in Figure 2.7, The upper part of each 
panel shows the location of restriction sites m the DNA molecules in a diploid 
genotype. The ff-type molecule contains one additional restriction site not 
present in the /l-type molecule. The lower part of the figure demonstrates 
that, with suitable probe DNA, all three genotypes can be distinguished by 
then pattern of restriction fragments. A difference in the length of a restric- 
tion fragment found segregating in natural populations is called a restriction 
fragment length polymorphism or RFLP Because RFLPs are widely distrib- 
uted throughout the genome of human beings and other organisms, they 
have assumed major importance in population genetics 

The Polymerase Chain Reaction 

The polymerase chain reaction (PCR) for the amplification of specific DNA 
sequences is of great utility in population genetics for the production of 
probe DNA or for the direct determination of the amount of nucleotide 
sequence variation present in natural populations. The method is outlined in 
Figure 2.8. The original DNA sequence to be amplified is shown in black and 
the newly synthesized DNA strands in gray The small ovals represent syn- 
thetic oligonucleotides that are complementary in sequence to the ends of 
the region to be amplified. The oligonucleotides are called primer sequences 
because they anneal to the ends of the sequence fo he amplified and are used 
as primers for chain elongation by DNA polymerase Primer oligonu- 
cleotides are typically 18-22 nucleotides in length. DNA to be used as the 
template in a PCR reaction is first mixed with both primers along with a ther- 
mostable DNA polymerase in a buffer solution. The PCR amplification takes 
place in cycles. In the first cycle, the DNA is heated to separate the strands 
and then cooled in (he presence of a vast excess of the primer oligonu- 
cleotides. Then elongation of the primers produces double-stranded mole- 
cules. The second cycle of PCR is similar to the fust but, after the second 
cycle, there are four copies of each original molecule. The cycle is repeated 
from 20 to 30 times, each resulting in a doubling of the number of molecules 
ihe theoretical result of n rounds of amplification is 2" copies of each tem- 
plate molecule originally present. 


52 Chapter 2 

DNA duplex 
lo be amplified 

o o 
o o 
o o 
o o 

First cycle 


Second cycle 

> nth cycle 
' 2" copies 

Third cycle 

Figure 2.8 The polymerase chain reaction (PCR). Short primer oligonucleo- 
tides arc used as primers to initiate DNA replication from opposite ends of a 
DNA duplex to be amplified. After each round of replication, the DNA is heated 
to separate the strands and then cooled to allow new primers to anneal. Repeat- 
ed rounds of replication result in an exponential increase in the number of tar- 
get molecules 

PCR amplification is very useful in generating large quantities of a specif- 
ic DMA sequence without the need for cloning. The main limitation of the 
technique is that the DNA sequences at the ends of the region to be amplified 
must be known so that primer oligonucleotides can be synthesized. There are 
many applications in which this requirement is met. In population genetics, 
for example, PCR can be used to amplify different alleles present in natural 

•»s - - « - * j»rftr x- ■**£ ":* ,\y 

PROBLEM 2.4 PCR was used to amplify five alleles (designated/-/) 
of the gene Rh3 coding for a light-sensitive protein in the eye of 
Drosophila simulans, a species of fruit fly closely related to D. 
mclanogaster. The resulting DNA fragments were sequenced (Ayala et 
al. 1 993). The data show the nucleotide present at each of 16 polymor- 
phic nucleotide sites found in the first 500 nucleotide sites in the 
amino acid coding region of the gene; the remaining 484 nucleotide 
sites were monomorphic in this sample. Any nucleotide site that is an 

Genetic and Phenotypic Variation 53 

exact multiple of three is at the third position of a codon. In this region 
of the gene: 

(a) what proportion of polymorphic nucleotide sites are in third posi- 
tions of codons? What can you infer from this observation? 

(b) what proportion of nucleotide sites are poiymorphic? 

(c) why is the standard error formula not appropriate for the estimate 
in part 0)? 

Muctattfcfe sftm In gmne. 

Ateft 132 
















f T 
















* ',* 
















h c 
















. ,- 'v c 
















i' c. 

















ANSWER (*}} Among the 16 polymorphic sites, only site 1 42 is not 

'ijl exact multiple of threeL hence 15/16 = 94% of the polymorphic 

sites 'Ire in the third codon position* the inference is that many of the 

nucleotide polymorphism^ are silent (synonymous) in that they do 

\akpi 'alijsjr tKe amino add sequence of the polypeptide: (In fact, all 16 

^ift.:$M'^^^|tisn^j'|iklildbg the C -> T change in 142, which 

JvaftB me codon from CtJA -*'tnJA, both of which code for leucine.) 

J^jlJA ^totll of, 16/500 == 3.2% df the nucleotide sites are polymorphic in 

$$& df the genfe ; '(c) The binomial standard error is not appro- 

^ L ml!^<5itee becau&e' tfie nucleotides within a 1 gene are not inde- 

"lf^m|>iesf rhe'y; are "genetically closely linked. '(A suitable 

pi the standard error is given later in this chapter.) 


Monomorphism or polymorphism of a gene in a sample is usually of interest 
only insofar as it indicates monomorphism or polymorphism of the gene in the 
population as a whole. In a population, a polymorphic gene is one for which -V 
the most common allele has a frequency of less than 0.95 (some authors prefer 
a more stringent cutoff at 99). Conversely, a monomorphic gene is one that is 
not polymorphic. The cutoff at 0.95 (sometimes 99) in the definition of poly- 

54 Chapter 2 

Genetic and Phenotypic Variation 55 

rnorphism is arbitrary, but it serves to focus attention on those genes in which 
allelic variation is common. In any large population, rare alleles are observed 
for virtually every gene. An allele is considered a rare allele if its frequency is 
less than 005; in human beings, between one and two people per thousand are 
heterozygous for rare alleles of any gene. Many rare alleles are deleterious and 
are presumably maintained in the population by recurrent mutation. The defi- 
nition of polymorphism is an attempt to focus on genes that have alleles with 
frequencies too high to be explained solely by recurrent mutation to harmful 
alleles. With the 0.95 definition of polymorphism given above, and if alleles are 
combined at random into genotypes, then at least 9.5% of the population is het- 
erozygous for the most common allele (because 2 x 0.95 x 0.05 = 0.095). 

Allozyme Polymorphisms 

Polymorphism of alleles that determine allozymes is extremely widespread. 
Figure 2.9 summarizes the results of electrophoretic surveys of 14 to 71 
(mostly around 20) genes in populations of 243 species. Each point in the fig- 
ure (except that for human beings) gives the type of organism studied and 
the number of species examined. The axis labeled Polymorphism refers to 
the estimated proportion of genes that are polymorphic by the 0.95 criterion. 
The axis labeled Heterozygosity refers to the average heterozygosity in each 
group. The average heterozygosity is the estimated proportion of genes 
expected to be heterozygous in an average organism; it is estimated as the 
proportion of heterozygous genotypes for each gene averaged over all genes. 
For example, the data for Europeans include an English population in which 
10 enzyme genes were examined (Harris 1966). Of the 10 genes, three were 
found to be polymorphic, from which the estimated proportion of polymor- 
phic genes in the genome is 3/10 = 0.3. The observed proportion of het- 
erozygous genotypes for each of the three polymorphic genes was 0,509 (for 
red-cell acid phosphatase), 0.385 (for phosphoglucomutase), and 0.095 (for 
adenylate kinase); the average heterozygosity in this sample — taking into 
account the additional seven genes for which the observed heterozygosity 
was 0— is therefore (0.509 + 0.385 + 0.095 + 7 x 0)/10 = 0.099. 

The vertical and horizontal bars on the point corresponding to Dwsophih 
indicate the size of the standard error of the estimate Therefore, the bars 
indicate the limits of polymorphism and heterozygosity within which about 
68% of the species are expected to fall. Among Drosophila species, approxi- 
mately 68% have a proportion of polymorphic genes in the range 0.30-0.56 
and an average heterozygosity in the range 0.09-0.19. Such bars could be 
attached to each point; their lengths would be comparable to those for 
Drosophik, indicating substantial variability in polymorphism and heterozy- 
gosity among species within groups. 

Figure 2.9 has no simple summary because of the immense variability in 
polymorphism and heterozygosity found within each group of organisms (as 

6 0.40 
% 035 


1 0.25 
°- 0.20 


{Europeans, 71 loci) 

Insects (23) 
(excluding Dwsophih) 


, Drosophik (43) 

Reptiles (17) 
Birds (7) • 

• ••s^-AU vertebrates (135) 
Mammals (46) 

^ "~- Ail invertebrates (93) 
Invertebrates (27) J 

(excluding insects) 


\ Amphibians (13) 

Plants (15) 




0.10 0.12 14 


Figure 2.9 Estimated levels of heterozygosity and proportion of polymorphic 
genes derived from allozyme studies of various groups of plants and animals. 
The number of species studied is shown in parenthesis beside each point. 
Squares denote averages for plants, invertebrates, and vertebrates. The bars 
across the Drosophik point indicate the standard error within which about 68% 
of the species are expected to fall. Other groups have similarly large standard 
errors. (Data from Nevo 1978.) 

indicated by the length of the variability bars corresponding to Drosophila) 
On the whole, there is a positive relationship between amount of polymor- 
phism and degree of heterozygosity. This relationship is as expected because 
the greater the fraction of polymorphic genes in a population, the more genes 
that are expected to be heterozygous on the average. The overall mean poly- 
morphism in Figure 2.9 is 0.26 ± 0.15, and the mean heterozygosity is 0.07 ± 
0.05. Vertebrates have the lowest average amount of genetic variation among 
the groups in Figure 2.9, plants come next, and invertebrates have the high- 
est. Drosophila is the most genetically variable group of higher organisms so 
far studied, and mammals the least variable. Human beings are fairly typical 
of large mammals: An extensive electrophoretic survey of 104 genes in a sam- 
ple including all major human races gave estimates of polymorphism of 0.32 
and heterozygosity of 0.06 (Harris et al. 1977). The one obvious conclusion 
that can be reached from Figure 2.9 is that allozyme polymorphisms are 
widespread among higher organisms. Genetic variation is even more preva- 
lent among some prokaryotes. For example, natural isolates of the mam- 
malian intestinal bacterium Escherichia coli exhibit levels of genetic 
polymorphism two or three times greater than vertebrates (Selander et al. 


56 Chapter 2 


Genetic and Phenotypic Variation 57 

Although genetic polymorphisms are widespread, they are not universal. 
For example, both major subspecies of the cheetah Aunamfiix jitbafus are vir- 
tually monornorphic (O'Brien ct al. 1987). A survey of 49 enzymes among 30 
animals from the East Alncan subspecies {A }. raiuci/i) yielded only two poly- 
morphic genes and estimates of polymorphism of 04 and heterozygosity of 
0.0 1; among 98 animals trom the South African species (A j. jubntus), the esti- 
mate of polymorphism was 02 and that of heterozygosity 0004. Most 
unusual was the finding of skin-graft acceptance between unrelated cheetahs 
Irom the South African subspecies. Graft acceptance means that the cheetah 
population is monornorphic for the major histocompatibility locus, which is 
abundantly polymorphic in other mammals Apparently, the cheetah, which 
was worldwide in its range at one lime but presently numbers less than 
20,000 animals, underwent at least two severe constrictions in population 
number resulting in the loss of most of its genetic variability 

How Representative Are AUozymes? 

The generality of estimates of polymorphism based on electrophoresis is 
somewhat uncertain. The amount of polymorphism may be underestimated 
because convenlional electrophoresis fails to detect many amino acid 
replacements. For example, in a study of 14 myoglobin proteins from various 
species including cetaceans (whales, dolphins and porpoises), no more than 
eight could be distinguished by conventional electrophoresis; however, 13 
could be distinguished by varying the pH value of the electrophoresis buffer 
(McLellan and Inouye 1986) Some amino acid replacements can be detected 
because they render the enzyme sensitive to high temperatures; a test for 
temperature sensitivity increased the number of identified alleles of the gene 
coding for xanthine dehydrogenase in Drasophila pseudoobscura from 6 to 37 
and increased the estimate of average heterozygosity from 0.44 to 0.73 (Singh 
et al 1976). On the other hand, although more elaborate techniques reveal 
additional alleles of genes known to be polymorphic, thus increasing esti- 
mates of heterozygosity, genes classified as monornorphic by means of rou- 
tine electrophoresis tend to remain monornorphic, and so estimates of poly- 
morphism remain much the same as before. 

Electrophoretic surveys might also overestimate the amount of polymor- 
phism because the enzymes typically surveyed are those found in relatively 
high concentration in tissues or body fluids ("Group I enzymes") and often 
lack the high substrate specificity of enzymes implicated in central metabol- 
ic processes ("Group II enzymes"). For example, among 10 Group I and 11 
Group If enzymes in Drosopliilu, estimates of polymorphism and heterozy- 
gosity were 0.70 and 0.24 in the former and 0.27 and 0.04 in the latter (Gilles- 
pie and Langley 1974). In summary, protein electrophoresis is a convenient 
method for delecting polymorphisms, but it is difficult to extrapolate from 

electrophoretic surveys of enzymes to the entire genome because the 
enzymes may not be representative 

Polymorphisms in DNA Sequences 

One inevitable limitation of protein electrophoresis is the inability to detect 
variation in a nucleotide sequence that does not alter the amino acid 
sequence. A polymorphism is silent if it is present in the coding region but 
does not alter the amino acid sequence; many nucleotide differences in third- 
rodon position are of this type. A polymorphism is noncoding if it affects 
nucleotides in noncoding regions such as the upstream region, the down- 
stream region, or introns. Silent and noncoding polymorphisms may have 
subtle effects on the organism, and the alleles may be affected by natural 
selection, the polymorphic alleles are silent ot noncoding only in the sense 
that they all code for the same amino acid sequence. An example of exten- 
sive silent polymorphism in Drosophila is illustrated in Figure 2.10 for alleles 
of the gene coding for alcohol dehydrogenase. This gene has an elec- 
trophoretic polymorphism that is widespread in natural populations with 
two predominant alleles, slow {Adh-S} and fast (Adh-F). The molecular dif- 
ference is that, in the fourth and last exon of the gene, the codon for amino 
acid number 193 in Adh-S is AAG (lysine) and in Adh-F is ACG (threonine). 
The enzymes differ not only in electrophoretic mobility The product of the 
fast allele has a greater enzymatic activity and is also synthesized in greater 
amount than that of the slow allele. 

The data in Figure 2.10 are derived from studies of RFLPs in the Adh 
region of 1533 flies isolated from 25 populations throughout eastern North 
America (Berry and Kreitman 1993). A total of 113 haploty pes were identi- 
fied. A haplotype is a unique combination of genetic markers present in a 
chromosome. In Figure 2.10, the haplotypes indicated with squares are Adh-F 
and those with circles are Adh-S. The number inside each symbol is the rela- 
tive abundance of the haplotype (1 being the most frequent, 2 the next most 
frequent, and so forth). A straight line connecting two haplotypes indicates 
that they differ by a single change. Figure 2.10 includes 93 haplotypes related 
to at least one other by a singe change; the other 20 haplotypes observed in 
the study include additional changes. The main point of the Adh example is 
that natural populations contain a great abundance of different types of 
nucleotide-sequence variation that does not affect ammo acid sequence. 

Nucleotide Polymorphism and Nucleotide Diversity 

Sequence data can be used quantitatively to estimate the level of genetic vari- 
ation at the nucleotide level. The data in Problem 2.4 are typical and so will 
be used to exemplify the calculations. The level of nucleotide polymor- 
phism, symbolized 9, is the proportion of nucleotide sites that are expected 

58 Chapter 2 

, 97 49 83 

Figure 2.10 Haplotypes of alleles i n ihe Mh region of Drosophita melanogastn 

fioin the East Coast of North America Each line in the network connects two 
haplotypes differing by a single molecular difference. An additional 20 haplo- 
types, differing by more than one change from those in the network, are not 
shown. Squares indicate the Adli-F allele, circles the Adh-S allele (Fiom Berry 
and Kreitman 1993.) 

Genetic and Phenotypic Variation 59 

to be polymorphic in any sample of size 5 from this region of the genome. 
The estimate equals the proportion of nucleotide polymorphism observed 
in the sample, often symbolized as S, divided by 


where n is the size of the sample, fn this case, S = 16/500 = 0.032 for a sam- 
ple of size ji = 5, so that a, = 1/1 + 1/2 + 1/3 + 1/4 = 2.083 The estimate of 
9, per nucleotide site, is therefore 



= 0.015 


As noted in Problem 2.4, the variance of 9 is not binomial because, owing 
to genetic linkage, successive nucleotides cannot be regarded as realizations 
of independent trials. An approximation to the variance can be derived under 
the assumption that the nucleotides at a site are functionally equivalent or 
invisible to natural selection; the mathematical details are beyond the scope 
of this book, but the result is quite simple. The variance of 6, per nucleotide 
site, is given by 

V(B) = 

e A 7 e 2 

ka l 


where « t is as defined in Equation 2.4, k is the number of nucleotides in each 
sequence (in our example, k = 500), and a 2 is a function of the number of alle- 
les n in the sample, namely 



For n = 2 through 10, the values of a 2 are 1, 1.25, I 36, 1.42, 1 46,1.49, 1 51, 
1-523, 1.54. fn the case at hand, n = 5 and the estimated variance of 9 = 
0.015/(500x2.083) + 1.42 x 0.015 2 /2.083 2 = 9.2131 x 10 5 . The standard error 
of 9 is the square root of the variance or, in this case, 0096 per nucleotide 


A second quantity used to assess polymorphisms at the DN A level is the 
nucleotide diversity, typically denoted k, which is the average pioportion of 
nucleotide differences between all possible pairs of sequences in the sample 
In a sample of n sequences, there are n{n - l)/2 pa it wise comparisons. For 
the data in Problem 2.4, n = 5, and so there are 1 pair wise comparisons. The 
pairwise comparisons may be considered for each nucleotide in turn and the 

60' Chapter 2 

differences averaged latei. For the polymorphic sites in Problem 2 4, the num- 
ber of pairwise differences is 6 (= 2 x 3) lor sites 132, 142, 246, 351, 405, and 
483; it is 4 (= 1 x 4) for sites 162, 198, 201, 207, 240, 354, 372, 375, and 417, and 
it is 7 for site 192 Among the 484 monomorphic nucleotides in Problem 2 4, 
the number of pairwise differences is 0. The average proportion of pairwise 
differences between the sequences in the sample is the estimate — ft — of the 
nucleotide diversity; hence, 

Jt = (6x6-f-4x9 + lx7 + 0x 484)/(10 x500) = 0.016 

The variance of ft is estimated as follows: 

Var{ii) = — it + frijr 

where k is again the length of the sequences m nucleotides and where 

n + 1 

/>, =- 

lh = 

2(ti 2 +n + 3) 


2 10 

For example, when n = 5, then b, =05 and b 2 = 0.37, and so Varfit) = 
(0.5/500) x 0.016 + 0.37 x 0.01 6 Z = 0.0001 07; the standard error of ft is the 
square root, or 0.010. 

The estimates of 9 and ji based on nucleotide sequences are not readily 
convertible lo levels of polymorphism and heterozygosity expected at the 
protein level The main reason is that most observed nucleotide polymor- 
phisms are either silent or noncoding and so do not change the amino acid 
sequence of the polypeptide. The level of protein polymorphism is deter- 
mined to a large extent by the degree to which the amino acid sequence is 
constrained by natural selection against variant sequences (or, in some cases, 
by natural selection for variant sequences), and constraints at the protein 
level are not generally predictable from 6 and ji. 

On the other hand, there is a theoretical relation between B and it that is 
expected under the simplifying assumption thai the alleles are invisible to 
natural selection The theoretical basis of relation between 9 and n is dis- 
cussed in connection with the neutral theory ol molecular evolution in Chap- 
ter 8, but the expected relation is that 8 = n For the data in Problem 2.4, for 
example, 6 = 0.015; this number is to be compared with ft = 0.016, and so the 
agreement with expectation is quite good. {On the other hand, the sample 
size is very small.) 

Estimates of nucleotide polymorphism and diversity can also be carried 
out with restriction-site data in the form of restriction fragment length poly- 

Genetic and Phenotypic Variation 61 

morphisms (RFTPs). The simplest way to proceed is to analyze the restriction 
sites in turn Each monomorphic restriction site is regarded^ identifying six 
adjacent monomorphic nucleotides (or four monomorphic nucleotides, if the 
eivyme has a four-base restriction site). Each polymorphic restriction's! re is 
regarded as identifying five monomorphic nucleotides and one polymorphic 
nucleotide (or three monomorphic and one polymorphic, if the enzyme has a 
four-base restriction site). In other words, each restriction site polymorphism 
is supposed to result from polymorphism of a single nucleotide in the restric- 
tion site. Pairwise comparisons to estimate n are carried out under this 
assumption. The reasoning is illustrated in the following problem. 

PROBLEM 2.5 Restriction-site variation was studied around the 
'■• gene for alcohol dehydrogenase (Adh) iri a population of D, 

MWQ#¥& I de^cettotM: front artimaVtrap|>ed at a Dutch fruit market' 
^^^W*^ (GlMS l ■*& BWey 1986J. the region contained a total of 
pi;2|,^i|;:for five: | re&tricjfkm eijzytne^ ejJch having a six-base, restriction 
P^fife^v^ 1 ° f 16 sites weie but' iri nil flies in the sample. The a'ceoiri- 
L>;$*W% "P^¥ documents the presence (*j or absence (-} of each of the 
£( ^^I^J^^^^ 1 ^^ ^^ ^ **' .^^^fe of' 10 cWcimcwbines. Estirhate the 
p'^^W^ypo^^^CftUcIeotidw §,the nucleotide diversity fa 
K''u : ',i#4 the standard error of each. Does the relation & = it seem to hold 
hV^ftiie estimates? - 

Consider first the nucleotide pdfymorphisms. The 16 

_ & sites identify 16 x 6 = 96 monomorphic nucleotides; the 

lofphie sites identify 7 x 5 = 35 monomorphic sites and 7 x 1 

62 Chapter 2 

polymorphic nucleotides {assuming only 1 nucleotide is altered for 
each restriction site that is lost). Altogether, there are 138 nucleotides 
of which 7 are polymorphic. Because n = 10, then rt l = 2.83 and h z = 
1.54 The estimate of is therefore 9 * (7/ 138)/ 2.83 = 0.0179 per 
nucleotide site and Var{9) = 0.0179/(138 x 2.83} f (1.54 x 
0.0179 2 /2.83 2 ) = 1.0778 x 10 -4 , The standard error of 8 is therefore 
0.0104 per nucleotide site. For estimating it, there are 10 x 9/2 = 45 
pairwise comparisons, and a restriction site with i "plus" and (10 - 1) 
"minus" means that the polymorphic nucleotide site results in i x (10 
- i) pairwise mismatches. Therefore, the total number of mismatches 
for each of the restriction sites, from left to right, equals 16, 24, 9, 16, 9, 
9, and 21, respectively, totaling 104, In addition, there are 16 x 6 
nucleotides (from the monomorphic sites) and 7 x 5 nucleotides (from 
the polymorphic sites) for which the number of pairwise mismatches 
equals 0. Therefore, k = 104/(45 x 23 x 6) = 0.017. For n = 10, b x = 0.407 
and b 2 = 0.279, and so Var(ft) = 0.0001277; hence the standard error 
equals 0.011. In these data, = 0.018 and it = 0.017, which are in very 
good agreement. However, the sample size is too small to generalize 
this conclusion. 

Uses of Genetic Polymorphisms 

Whether studied through allozyrues or nucleotide sequences, natural genet- 
ic variation has many uses Genetic variation provides a set of built-in mark- 
ers for the genetic study of organisms in their native habitats, including 
organisms (or which domestication or laboratory rearing is unfeasible or for 
which conventional genetic manipulation is impossible. 

Genetic polymorphisms are useful in investigating the genetic relation- 
ships among subpopulations in a species. The principle is that alleles are 
shared among subpopulations because of migration, and therefore similarity 
in allele frequencies among subpopulations can be used to estimate the rale 
of migration (Chapter 4). Within subpopulations, alleles are shared because 
o( common ancestry. For example, the Ainu people of Northern Japan have 
numerous Caueasoid-hke features, including their facia! features, light skin, 
and hairy bodies, yet their genetic polymorphisms clearly show Ihem to be 
more closely related to other Mongoloid groups (Watanabe et al 1975). 
Among the most informative alleles, the Ainu people possess the D(Cht) 
allele of transferrin protein and the Di" allele of the Diego blood group, both 
of which are virtually restricted to Mongoloid populations. Conversely, the 
Ainu people lack several alleles that are polymorphic in Caucasoids. 

Genetic and Phenotypic Variation 63 

From a practical point of view, genetic polymorphisms are useful in 
human populations as genetic markers that may be genetically linked to 
harmful genes that cause disease. In kinships with a family history of the dis- 
ease, the genetic markers can be used to determine which members of the 
kindred are likely to be carriers of the harmful gene. The markers can also be 
used in early diagnosis of persons likely to be affected. RFLPs and other 
types of DNA polymorphisms that are linked to disease genes have also 
demonstrated their utility as probes for identifying recombinant DNA clones 
containing the defective genes. The nearby genetic markers enable the defec- 
tive gene and its function to be identified, thus serving as a first step in the 
search for effective treatments. 

Particularly useful in population genetics are DNA markers with a large 
number of alleles of moderate frequency. In most organisms, many regions of 
the genome have multiple alleles consisting of a short sequence of bases 
repeated in tandem. Multiple alleles result because the number of copies of 
the repeated sequence may differ from one chromosome to the next The 
genotypes are even more variable because each genotype carries two alleles 
One of the practical applications of the use of such polymorphisms is in DNA 
typing, in which the alleles in the DNA from a suspect are matched with those 
from a crime-scene sample. The examination of a sufficient number of such 
highly variable regions provides a basis for distinguishing one person from 
another because no two people (with the exception of identical twins) have 
the same genotype. Genetic variability of this sort is used in determining 
paternity as well as in criminal investigations. The experimental methods of 
DIM A typing, and certain relevant issues in population genetics, are discussed 
in Chapter 4. 

DNA typing has also been applied to studies of the natural mating sys- 
tems of plants and animals because, with the large number and high speci- 
ficity of DNA types, close relatives can be detected in popufations. In 
behavioral studies, DNA typing can determine whether organisms that per- 
form mutually altruistic acts are genetically related. Polymorphisms of other 
types can also be informative about mating systems. For example, the 
observed frequencies of genotypes can be used to estimate the amount of 
self-fertilization in populations of monoecious plants or hermaphroditic 

From the standpoint of evolutionary biology, sequences of genes and pat- 
terns of polymorphism can be used to make inferences about evolutionary 
history and about the evolutionary process The sequences ol macromole- 
cules contain within themselves a record of their evolutionary history Organ- 
isms with a shared ancestry usually have similar gene sequences Conversely, 
similarity in sequence can be regarded as a measure of shared ancestry. As an 
index of shared ancestry, sequence similarity provides a means of inferring 
the ancestral relationships among a group of organisms {iiiolaular ;>hy!o$e- 

64 Chapter 2 

Genetic and Phenotypic Variation 65 

nvtics, discussed in Chapter 8). The rates and patterns of change in sequence 
within species and between closely related species also contain a record of 
evolutionary forces at work Within the past 20 years, population genetics has 
gone from a data-poor field to a data-rich field, and numerous new methods 
of data analysis and hypothesis testing have been developed. 


We have seen that Galton and Mendel chose opposite types of traits for their 
studies of variation: Galton chose continuous traits, Mendel discrete traits. 
The choices reflected a deep difference of opinion in the manner in which 
inheritance should be studied. Galton's approach was empirical, based on 
the observed similarity between relatives such as parents and offspring 
Mendel's approach was theoretical, based on unobserved segregating factors 
that determined the patterns of inheritance. Even after the rediscovery of 
Mendel's paper in 1900, the disciples of Galton (called "biometricians") dis- 
missed its significance, claiming that the postulated Mendelian factors were 
not only irrelevant for continuous traits but also inadequate to explain the 
observed correlations between relatives. The Mendelians argued that segre- 
gation and independent assortment could explain continuous traits just as 
well as discrete traits. The acrimonious dispute between the biometricians 
and the Mendelians continued for nearly 20 years. 

The dispute abated substantially with a 1918 paper by the statistician 
Ronald Aylmer Fisher (1890-1962) entitled "The correlation between rela- 
tives on the supposition of Mendelian inheritance." Fisher examined a math- 
ematical model of multifactorial inheritance and deduced the expected 
conelalions between relatives. He showed that the kinds of data available for 
continuous traits were not only compatible with Mendelian inheritance but 
were also predicted by it. 

The spirit of Fisher's model is shown in Figure 2 11, which illustrates the 
genetic variation expected among the progeny of a cross between genotypes 
that aie heterozygous for each of three unlinked genes The alleles of the 
genes are represented A/a, B/b, and C/c, and the genetic variation resulting 
from segregation and independent assortment is evident in the various 
degrees ol shading. If we assume a trait in which each uppercase allele adds 
one unit to the phenotype and in which each lowercase allele is without 
effect, then the aa bb cc genotype has a phenotype of and the AA BB CC 
genotype has a phenotype of 6. Thus there are seven possible phenotypes 
(0-6) among the progeny The distribution of phenotypes is shown in the 
bar graph in Figure 2.12 The smooth curve is the normal distribution approx- 
imating the data, which has a mean of 3 and a variance of 1.5 In Figure 2.10, 
we have assumed that all of the variation in phenotype results from differ- 
ences in genotype If there were also random environmental! factors affecting 
the trait, as well as a greater number of genes, then the bars in Figure 2.12 




a he 


Figure 2. 11 Result of segregation of three independent pairs of alleles affect- 
ing the same trait. Each allele that is indicated by an uppercase letter is assumed 
to contribute one unit to the phenotype. The phenotypes range from to 6 and, 
in the cross between triple heterozygolcs, are formed" in the proportions 1 6.15. 
20:15 6:1 V l 

would become less distinct and a normal distribution approximated even bet- 
ter. The result is the central limit theorem at work producing Galton's 
"supreme law of unreason." 

Fisher's model was a good deal more complex than that in Figure 2.10, 
allowing for differences in the effects of alleles, differences in allele frequen- 

66 Chapter 2 

Figure 2.T2 Distribution of phenotypes from the cross in Figure 2.11 and the 
approximating normal distribution. The normal curve has mean 3 and vari- 
ance 1 5. 

cy, various types of dominance relations, and the effects of random environ- 
mental factors. The work was pathbreaking in demonstrating that continu- 
ous variation could be explained by multiple interacting Mendelian factors. 
Fisher's model was complex for its time and the paper a difficult one. It is not 
clear even now what practical role Fisher's paper may have played in ending 
the controversy between the biometricians and the Mendelians. Not many 
people seem to have read it. On the other hand, it is the seminal paper that 
marked the reconciliation of the theories of Gallon and Mendel. 


Galton examined the statistical relations between the distributions of phe- 
notypic traits in successive generations. Most of the traits he studied were 
continuous traits, like height or weight, which are measured on a quantita- 
tive scale Galton was very taken with the observation that the phenotypes 
of many continuous traits are distributed according to the bell-shaped curve 
known as the normal distribution. The peak of the normal distribution is 
determined by the mean and the spread is determined by the variance. 
Phenotypic variation in natural populations is usually in the form of differ- 
ences in continuous traits. Most continuous traits are also multifactorial, that 
is, determined by the combined effects of multiple genetic and environmen- 
tal factors. The normal distribution is often encountered in practice because 
of the central limit theorem, which states that the limiting distribution of the 
sum of a large number of independent random quantities is normal. 

Genetic and Phenotypic Variation 67 

Mendel studied discrete variation, such as round versus wrinkled peas, 
resulting from segregation of the alleles of a single gene Simple Mendelian 
variation is the rule for genes and their products. Genetic variation in protein 
molecules can be identified by such techniques as protein electrophoresis. 
Proteins differing in electrophoretic mobility that are coded by alternative 
alleles of the same gene are called allozymes. Allozyme variation is wide- 
spread in most organisms. Based on electrophoretic surveys of human popu- 
lations, about 30% of all enzyme-coding genes are polymorphic (in the sense 
that the most common allele has a frequency less than 0.95), and about 7% of 
the loci are heterozygous in an average person. Plants and invertebrates have 
even higher levels of allozyme variation. Although there is wide variation 
among species, Drosophila averages about 40% polymorphic loci with an aver- 
age heterozygosity of 14%. 

Genetic variation at the DNA level can be detected with the Southern blot 
procedure, in which DNA fragments produced by a restriction enzyme are 
separated by electrophoresis and identified by hybridization with a homolo- 
gous labeled probe sequence. Polymorphisms in the length of restriction 
fragments (restriction fragment length polymorphisms) are abundant 
throughout the genome and have applications in studies of genetic linkage in 
many organisms. DNA studies are also often carried out with the polymerase 
chain reaction (PCR), in which multiple cycles of primer annealing, DNA 
replication, and strand separation are used to exponentially amplify the DNA 
sequence flanked by the oligonucleotide primers. Amplified DNA may be 
sequenced, used as a probe, or manipulated in other ways. 

Polymorphisms in nucleotide sequence are abundant in natural popula- 
tions, particularly in noncoding regions and at silent sites in coding regions 
(especially at third codon positions, in which a nucleolide substitution need 
not result in an amino acid replacement). For a sample of DNA sequences, 
the amount of nucleotide polymorphism is the proportion of nucleotide sites 
occupied by two or more bases (A, T, G, C) in the sample. The nucleotide 
diversity is the average proportion of nucleotide differences between all 
sequences in the sample taken in pairwise comparison. The estimates of 
nucleotide polymorphism and nucleotide diversity are not readily compared 
with allozyme data because much of the observed sequence variation is 
either noncoding or silent. 

There is often a disconnect between molecular variation and phenotypic 
variation because differences in phenotype among healthy organisms cannot 
usually be attributed to differences in specific molecules. Indeed, there is a 
sort of disconnect between simple Mendelian inheritance and continuous 
variation because the segregation of any pair of alleles affecting a continuous 
trait is obscured by the segregation of other pairs of alleles as well as by the 
effects of the environment. In the early years after the rediscovery of 
Mendel's paper, there was considerable controversy whether Mendelian 

68 Chapter 2 

factors could account for the patterns of variation and correlation among rel- 
atives noled by Galton and others. The issue was resolved theoretically by 
R. A. Fisher's 1918 paper on the correlation between relatives on the supposi- 
tion of Mendelian inheritance. Closing the gap between the study of evolu- 
tion at the level of phenotypes and at the level of molecular genotypes 
remains one of the major challenges in population genetics. 


1. Shell widths of mussels are approximately normally distributed. If the 
mean is 70 mm and the standard deviation is 10 mm, what fraction of 
the population is smaller than 80 mm? 

2. Following Problem 1, what fraction of the population is between 80 and 
90 mm in width? 

3. Calculate the mean, variance, standard deviation, and standard error of 
the mean for the following bristle counts: 13, 14, 13, 15, 14, 15, 12, 13, 14, 

4. Measurements of body weight of a very large sample from a species of 
mouse have a mean of 60 g and variance of 64 g 2 . In one area it was sus- 
pected that environmental contamination had reduced the size of the 
mice. A sample of 100 mice from this area had a mean of 58 g and a sam- 
ple variance of 64 g 2 . Is this sample population significantly smaller in 
size than the population examined with the very large sample? 

5. A standard means for using a computer to generate normally distributed 
random numbers is to take 12 uniform random numbers and add them 
up. After scaling the sum by a constant that depends on the mean and the 
variance, the result represents a sample from the normal distribution one 
wants. Why does this approach work? 

6. One statement of the central limit theorem is that the sum of indepen- 
dent, identically distributed random variables has a limiting normal dis- 
tribution. If the variables that are being added exhibit positive covariance 
in successive measures (as opposed to being independent), how would 
the sum deviate from the normal distribution predicted by the central 
limit theorem? 

7. Allozyme gels reveal a sample with 64 FF, 32 FS and 4 SS females, but 
there seem to be 40 FF males and 10 SS males with no heterozygotes. 
How do you explain these data? 

8. Many proteins exist in an active form only as dimers, with two molecules 
joined either by hydrogen bonding or even by covalent cysteine bridges. 
If an enzyme is only active as a dimer, and there is electrophoretic varia- 
tion in a population with two alleles (F and S), what do you think a het- 
erozygote would look like on a gel? What would a heterozygote look like 
if only tetramers were active? 

Genetic and Phenotypic Variation 69 

9. How many copies of a fragment of DNA should be present after 30 
rounds of PCR, assuming perfect efficiency? 

10. Taq polymerase does not have perfect fidelity in copying DNA sequences, 
and the result is that PCR products have some variation in sequence' 
Why do you suppose it is still possible to sequence PCR-amplified DNA 
to obtain the true sequence? When might the errors caused by Tacj poly- 
merase cause errors in the final sequence? 

11. Many new ways of scoring DNA variation at individual nucleotide sites 
are becoming available, including an oligonucleotide ligation assay, 
"Taqrnan," template-directed dye-terminator incorporation (TDI), and 
hybridization to dense oligonucleotide arrays known as DNA chips. An 
important criterion for the utility of any of these methods is that it must 
be very accurate. Why is accuracy so critical? 

12. Four sequences of a 1200 bp gene gave the following counts of pairwise 
differences: 4, 7, 5, 3, 6, 5. What is the estimate of nucleotide diversity for 
this sample? 

13. In forensic applications of genetics, if the DNA types from a crime scene 
and a suspect do not match, the confidence one has in the conclusion is 
much greater than if the types do match. Why? 


Organization of Genetic Variation 

Random Mavng - Hardy-Weinberg Principle Chi-square Test 

Multiple Alleles • Linkage Disequilibrium 


HE word population HAS so far been used in an informal, intuitive 
sense to refer to a group of organisms belonging to the same 
species. Further discussion and clarification of the concept is nec- 
essary at this time. In population genetics, the word population does not 
usually refer to an entire species; it refersjLnstead to a group of organisms of 
the same species livijigjwithina sufficiently restricted geographical area that 
.my member can potentially mate with any other member (provided that 
they areof thejapposite sex). Precise defmition of such a unit is difficult and 
vftnrs from species to species because of the almost universal presence of 
■Hunc sort of geographical structure in species — some typically nonrandom pat- 
fern in the spatial distribution of organisms. Members of n species are rarely 
distributed homogeneously in space: there is almost always some sort of 
clumping or aggregation, some schooling, flocking, herding, or colony for- 
mation Population subdivision is often caused by environmental patchiness, 
<iri\ns of favorable habitat intermixed with unfavorable areas. Such environ- 
mental patchiness is obvious in the case of, for example, terrestrial organisms 
»n islands in an archipelago, but patchiness is a common feature of most 
Jinhitats— freshwater lakes have shallow and deep areas, meadows have 
marshy and dry areas, forests have sunny and shady areas. Population sub- 
p'' . ,S1 °k C3n a ' S ° ^ e cause( * ^y social behavior, as when wolves form packs. 
■vi-n the human population is clumped or aggregated— into towns and 
u,l ^™ay from deserts and mountains. 


72 Chapter 3 

The local interbreeding units of possibly large, geographically structured 
populations are of some interest because it is within such local units that 
adaptive evolution takes place through systematic changes in allele frequen- 
cy. Such local interbreeding units— often called local populations or 
denies— are the fundamental units of population genetics. Local populations 
are the actual, evolving units of a species. Unless otherwise specified (or clear 
from context), the term population as used in this book means local population. 
Local populations are sometimes also referred to as Mendelian populations or 


In sexual organisms, genotypes are not transmitted from one generation to 
the next. Genotypes are broken up in gamete formation by the processes of 
segregation and recombination, and they are assembled anew in each gener- 
ation in fertilization: genotypes -> gametes -> genotypes. The frequency of a 
specified genotype in a population is the genotype frequency. The formation 
of a genotype in newly fertilized eggs is determined by the opportunity for 
the relevant gametes to come together in fertilization, and the opportunity for 
gametes to come together in fertilization is determined by the matings that 
take place among organisms of reproductive age in the previous generation. 
To put the matter in a slightly different way, the genotypes of the mating 
pairs determine the genotypes of the progeny. Furthermore, there are mathe- 
matical relationships between the frequencies of mating pairs and the fre- 
quencies of progeny genotypes. Such mathematical relationships are usually 
inferred from models in which the types of matings in the population are 
specified. One of the important models in population genetics is that of ran- 
dom mating, in which mating pairs have the same frequencies as if they were 
formed by random collisions between genotypes. The chance that an organ- 
ism mates with another having a prescribed genotype is therefore equal to 
the frequency of the prescribed genotype in the population. For example, 
suppose that in some population the genotype frequencies of AA, Aa, and aa 
are 0.16, 0.48, and 0.36, respectively; if mating is random, AA males mate with 
AA, Aa, and aa females in the proportions 0.16, 0.48, and 0.36, respectively; 
these same proportions apply to the mates of Aa and aa males. 

Superficial appearances to the contrary, random mating is not a simple or 
trivial process. One complication is that random mating depends on the trait: 
mating can be random with respect to some traits but nonrandom with 
respect to other traits at the same time and in the same population. For exam- 
ple, it is perfectly consistent for a human population to undergo random mat- 
ing with respect to blood groups, allozyme phenotypes, restriction fragment 
length polymorphisms, and many other characteristics, but at the same time 
to engage in nonrandom mating with respect to other traits such as skin color 

Organization of Genetic Variation 73 

d height. A second complication is population substructure Paradoxical as 
■'" ay seem, random mating may be observed within each of the subpopu- 
I'tions constituting a larger population, but random mating may still fail to 
h ild in the population as a whole. (The reason for this paradox is discussed 

Chapter 4) In spite of these and other complications, random mating plays 
' n important role in models in population genetics because random mating 
often serves as a point of departure for considering more realistic situations. 

Nonoverlapping Generations 

One of the most important mathematical models in population genetics is the 
nonoverlapping generation model, in which the cycle of birth, maturation, 
and death includes the death of all organisms present in each generation 
before the members of the next generation mature. The nonoverlapping gen- 
eration model is diagrammed in Figure 3.1. The model applies literally only 
lo organisms with a very simple sort of life history, such as certain short-lived 
insects or annual plants that have a short growing season. In such plants, all 
members of any generation germinate at about the same time, mature togeth- 
er, shed their pollen, are fertilized almost simultaneously, and die immedi- 
ately after producing the new generation. This sort of hypothetical 
population, with its simple life history, is used in population genetics as a 
first approximation to populations that have more complex life histories. 
Although at first glance the model seems hopelessly oversimplified, calcula- 





Generation t - 1 







Generation t 

Generation/ + 1 

oSw nonoverlapping generation model. The life history of the 

ore'ln v' S a5 JT l ° be Uke that of an annual P ,anf (° r an V short-lived 
ennnrJ 1 ! a v a , , 8 eneration s are assumed to be separated in time (discrete 
ZlZl ITr ' Allhou § h the model is simple, it provides a convenient first 
\ \ n ximation to populations with more complex life histories. 

74 Chapter 3 

lions of expected genotype frequencies based on the model are adequate for 
many purposes. In some applications, the nonoverlapping generation model 
turns out to be a useful approximation even for populations with a long and 
complex life history such as human beings. 

The Hardy-Weinberg Principle 

Genotype frequencies are determined in part by the pattern of mating. In this 
section, we consider the consequences of random mating in the model with 
nonoverlapping generations. To deduce the genotype frequencies under ran- 
dom mating, additional assumptions are needed. First, the allele frequencies 
should not change from one generation to the next because of systematic evo- 
lutionary forces, the most important of which are mutation, migration, and 
natural selection. For the moment, these evolutionary forces are assumed to 
be absent or negligibly small in magnitude. (Their effects are discussed in 
Chapters 5 and 6.) Second, the population must be large enough in size that 
the allele frequencies are not subject to change merely because of sampling 
error. Variation in allele frequency owing to sampling error in small popula- 
tions is called random genetic drift and is the subject of Chapter 7. Although 
random genetic drift is present unless the population is infinite in size, the 
magnitude of the effect on allele frequency over a small number of genera- 
tions is usually sufficiently small that the process can be ignored if popula- 
tion size is 500 or more. The qualifier "over a small number of generations" is 
important because the effects of random genetic drift are cumulative. Con- 
sidered over a sufficiently large number of generations, random genetic drift 
can be important even in populations of size 10 6 or more. 

Before proceeding further, it may be helpful to summarize the assump- 
tions that we are making: 

• The organism is diploid. 

• Reproduction is sexual 

• Generations are nonoverlapping. 

• The gene under consideration has two alleles. 

• The allele frequencies are identical in males and females, 

• Mating is random. 

• Population size is very large (in theory, infinite). 

• Migration is negligible. 

• Mutation can be ignored. 

• Natural selection does not affect the alleles under consideration. 

Collectively, these assumptions summarize the Hardy-Weinberg model, 
named after the English mathematician G. H. Hardy {1877-1947) and the 
German physiologist Wilhelm Weinberg (1862-1937), who, in 1908, indepen- 
dently formulated the model and deduced its theoretical predictions of geno- 
type frequency. 

Organization of Genetic Variation 75 

In the Hardy-Weinberg model, the mathematical relation between the 
allele frequencies and the genotype frequencies is given by 

AA:p 2 Aa:2pq aa:q 2 31 

in which p 2 2pq, and q 2 are the frequencies of the genotypes AA, An, and aa in 
zygotes of any generation, p and q are the allele frequencies of A and a in 
gametes of the previous generation, and p + q = 1 . The frequencies displayed 
in Equation 3.1 constitute the Hardy-Weinberg principle or the Hardv- 
Weinberg equilibrium (HWE). y 

One rationale for the Hardy-Weinberg principle displayed in Equation 3 1 
is based on the outcome of repeated and independent trials. With random 
mating the choices of male gamete and female gamete are independent tri- 
als, and so pairs of gametes carrying the alleles AA, Aa, or aa are expected in 
proportions given by (p A + q af = p 2 AA + 2pq Aa + q* «. A graphfcal illus- 
tration of the rationale of independent trials is shown in Figure 3 2 The 
chance of two A-bearing gametes coming together is p x p = p 2 and that of 
two fl-beanng gametes coming together is q x q = q 2 ; for the heterozygote, the 
chance ispxq + qxp^lpq because the female gamete could carry A and the 
male gamete carry a, or the other way around. 

Male gametes 

Allele A 

Frequency p 

Allele Frequency 
A p 






Summed frequencies in zygotes: 
AA. P' = p 2 
Aa Q' = pq + qp = 2pq 
aa: R' = q z 

raKnl? Cro ^-multiplication square showing Hardy-Weinberg freqi 
resulting from random mating with two alleles. 

76 Chapter 3 


Frequency of zygotes (progeny) 


frequency or 
mating (parenti) 





P 2 


AA xAa 


v 2 

y 2 

AA x aa 



Aa xAa 

Q 7 

v 4 

i / 2 


Aa xaa 


y 2 


aa xaa 

R 2 


Totals (next generation) P' 




P' = P 2 + 2PQ/2 + Q 2 /4 = (P f Q/2) 2 = p 2 

Q' = 2PQ/2 + 2PR + Q 2 /2 + 2QR/2 = 2(P + Q/2)(R + Q/2) = 2^ 

R' = Q 2 /4 + 2QR/2 + R z = (R + Q/2) 2 = q 2 

Random Mating of Genotypes versus Random Union of Gametes 

Figure 3.2 implicitly assumes an important premise: that random mating of 
genotypes is equivalent to random union of gametes. A demonstration of this 
premise in the case of two alleles is outlined in Table 3.1, in which pairs of 
genotypes are chosen at random to form matings. The genotype frequencies 
of AA, Aa, and aa in the parental generation are written as P, Q, and R, 
respectively, where P + Q + R = l.ln terms of the genotype frequencies, the 
allele frequencies p of A and q of a are as follows: 

p = (2xP + Q)/2 = P + Q/2 
q = (2xR + Q)/2 = R+Q/2 


Note that p + q = P + Q + R = 1.0; this result is a consequence of the fact 
that the gene has only two alleles. 

With two alleles of a gene, there are six possible types of matings. When 
mating is random, these mating types take place in proportion to the geno- 
typic frequencies in the population, and the types of mating pairs are given 
by successive terms in the expansion of (P AA + Q Aa + R aa) . For example, 
the proportion of AA x AA matings is P x P = P 2 . Similarly, the proportion of 
AA x Aa matings is 2 x P x Q because the mating can include either an AA 
male with an Aa female (proportion P x Q) or an Aa male with an AA female 
(proportion Q x P). The frequencies of these and the other types of matings 
are given in the second column of Table 3.1. 

Organization of Genetic Variation 77 

The genotypes of the zygotes produced by the matings are given in the 
last three columns ol Table 3.1. The offspring frequences follow from 
Mendel's law of segregation, which states that an Aa heterozygote produces 
an equal number of A-bearing and o-bearing gametes. The AA and aa 
homozygotes produce only ^-bearing and only a-bearing gametes, respec- 
tively. Thus, the mating AA x aa produces all Aa zygotes, the mating AA x Aa 
produces >/ 2 AA and V 2 Aa zygotes, the mating Aa x Aa produces '/ 4 AA, 
i/ 2 Aa, and % aa zygotes, and so forth. 

The genotype frequencies of AA, Aa, and aa zygotes after one generation 
of random mating are denoted in Table 3.1 as P', Q', and R', respectively. 
These values are calculated as the sum of the cross-products shown at the 
bottom of the table. The genotype frequencies simplify to P' = p 2 ,Q' = 2pq, 
and R' = q , where p and q are the allele frequencies given in Equation 3.2. 
Note that the parental genotype frequencies—?, Q, and R~ were completely 
arbitrary except for the requirement that P + Q + R = 1. Therefore, the Hardy- 
Weinberg frequencies are attained after one generation of random mating 
irrespective of the genotype frequencies in the parental generation. 

PMkm tJi A WS^ftiSiie site for. M rtM4ettdA efwiyiiffe 
■ , mi is fereil' L #lih&;i'i|P';&ft|i , 1 6f the larval tos^is^ tjftftfc eerie '■ 
; '$#$,$* |W;^ij|^ t>, mhnogastm Q^Mkt this 
'.^Wmjtmm'\st B:$§ft[mM$im$B isolated tt6m a pWitatM : 
'^W^^^^m^m^ml^M^ North Carolina (IMtnaW 
<WAgtt4« im).:iM$i\§ f ^JbibpmiA the present &t absent^ 
'jMHtfli tftftt&jHfc ite'lftt^ifttosotiie, and aaiuifo. tlu*fc». 
■fmm^'^^ ffequ*Mei kfcflate the expected irethieririefe^' 


«&Wrfi in ,t -i' i lWnlffn., 

1 i) '^%<^% ^f r 
amp 1 * 0.23 M t tyy * £} JQ m t and tf = 0.27 $. 

| genotype tttqmd& foi: $m alleles, &5 F and E6 S , of ttK! ieM &ding 

p " ft* efcteta^e we&imM fa tie riiiiitent with Hardy-Whii% j>h> 
s pmtions with allel* freqiiehdil of 0.3579 for E6 F an^ 0.^421 fof E6 8 

78 Chapter 3 

(Ivlukai et al. 1974). Assuming that all of the assumptions of the 
Hardy-Weinberg model hold, particularly those pertaining to random 
mating in a large population with no mutation, selection, or migra- 
tion, make a iable of mating frequencies similar to Table 3.1 for the 
esterase-6 alleles. Then calculate the genotype frequencies expected in 
the next generation along with the corresponding allele frequencies. 

ANSWER The Hardy-Weinberg frequencies among parents are FF: 
0.1281; FS: 0.4596, and SS: 0.4123. Therefore, the expected frequencies 
of the matings are: FF x FF (0.0164); FF x FS (0.1177); FF x SS (0.1056); 
FS x FS (0.2112); FS x SS (03790); and SS x SS (0.1700). The expected 
genotype frequencies among the zygotes are, for FF, 0.0164 + 0.1177/2 
+ 0.2112/4 = 0.1281; for FS r 0.1177/2 + 0.1056 + 0.2112/2 + 0.3790/2 - 
0.4596; for SS, 0.2112/4 + 0.3790/2 + 0.1700 = 0.4123; note that these 
are the same as in the parental generation. The allele frequencies of F 
and S are again 0.3579 and 0.6421, respectively. 

PROBLEM 33 Use a cross-multiplication square like that In Figure 
3.2 to show that, when the allele frequencies differ in. male and female 
parents, the Hardy-Weinberg frequencies are not attained after one 
generation of random mating. Use the symbols p m and q m for the fre- 
quencies of A and a in male gametes and the symbols p f and ft for the 
frequencies of A and a in female gametes. After the first generation of 
random mating, what are the genotype frequencies in male and 
female zygotes? What are the allele frequencies in male and female 
zygotes? What are the genotype frequencies in zygotes after the sec- 
ond generation of random mating? Are these in Hardy-Weinberg pro- 

ANSWER This problem demonstrates the principle that, with ran- 
dom mating, the frequency of an allele in zygotes equals the average 
of the allele frequencies in the parents. If the allele frequencies in par- 
ents differ, then random mating results in Hardy-Weinberg propor- 
tions only after two generations. The first generation equalizes the 

Organization of Genetic Variation 79 

allele frequencies in males and females, and the second generation 
yields the Hardy-Weinberg proportions. Using the suggested sym- 
bols, after one generation of random mating, the genotype frequen- 
cies are AA: p m x p,, Aa: p m xq f + q m x p f , and aa: q m x q f . These are not 
in the form x 2 , 2x{\ - x), and (1 - x) 7 unless p m = p f and q m = q { . How- 
ever, the allele frequencies have become equal in the sexes a I p = p m p f 
+ {Paflf + Httfd/2 = (p m + p,)/2 and q = {q m + q s )/2. The HWE is reached 
in one additional generation of random mating, in which the geno- 
type frequencies in zygotes are f 2 , 2pq, and q 1 . 

Implications of the Hardy-Weinberg Principle 

The Hardy-Weinberg principle has provided the foundation for many theo- 
retical and experimental investigations in population genetics However, the 
theory is far from profound, and the applicability is far from universal. 
Hardy especially seems to have regarded the Hardy-Weinberg principle as 
virtually self-evident. He writes, "I should 1 have expected the very simple 
point which I wish to make to have been familiar to biologists." In fact, it was 
familiar to some biologists — the basic principle had been noted as early as 
1903 by the Harvard geneticist William E. Castle (1867-1962). Castle's work 
was little known, however, and Hardy was writing to counter an argument 
put forth against Mendelism that phenotypic ratios of 3 dominant to 1 reces- 
sive should be encountered frequently in natural populations il the mecha- 
nism of Mendelian heredity were generally applicable. The immediate 
implication of the Hardy-Weinberg principle was to refute the 3 . 1 argument 
by showing that the genotypic ratio of A~ : aa is determined by the allele fre- 
quencies and has no special tendency to attain one particular ratio as any 

Beyond the virtue of simplicity, why would anyone want to consider a 
model based on so many restrictive and seemingly incorrect assumptions? 
And in what sense can such a simple model be considered fundamental? 
Among several reasons, two stand out. First, the Hardy-Weinberg model is a 
reference model in which there are no evolutionary forces at work other than 
those imposed by the process of reproduction itself In this sense, the model 
is similar to models in mechanical physics where objects fall through the sky 
without wind resistance or roll down inclined planes without friction The 
model affords a baseline for comparison with more realistic models in which 
evolutionary forces can change allele frequencies. Perhaps more importantly, 
the Hardy-Weinberg model separates life history into two intervals- gametes 
-> zygotes and zygotes -> adults. In constructing more complex and realistic 
models, one can often introduce the complications into the zygotes -> adults 

80 Chapter 3 

part of the life cycle — lor example, in considering the effects of migration into 
the population or of differential survival among the genotypes. With all 
sources of change in allele frequency accounted for in the zygotes — > adults 
component, the gametes — > zygotes component follows from the principle 
that random union of gametes and results in the Hardy-Weinberg propor- 
tions among zygotes. In other words, the Hardy-Weinberg model is funda- 
mental in the sense that the approach of tracking allele and genotype 
frequencies through time can be generalized to more realistic situations. 

One of the most important implications ol the Hardy-Weinberg principle 
emerges when we calculate the allele frequencies of A and a in the next gen- 
eration from the formulas for V , Q', and R' in Table 3.1. Using the result in 
Equation 3.2, the allele frequency of A among the zygotes equals P' + Q'/2 - 
p 1 + 2p<j/2 = pip + q) = p. Likewise, the allele frequency of a among zygotes 
equals R' + Q'/2 = q 2 + 2pq/2 - q(q + p) = q. Thus, the allele frequencies in the 
next generation are exactly the same as they were the generation before. With 
random mating, the allele frequencies remain the same generation after gen- 
eration. In any generation, therefore, the genotype frequencies are p 2 , 2pq, 
and q 2 for AA, Aa, and aa, respectively, as given in Equation 3.1 . The constan- 
cy of allele frequency — and therefore of the genotypic composition of the 
population— is the single most important implication of the Hardy-Weinberg 
principle. The constancy of allele frequencies implies that, in the absence of 
specific evolutionary forces to change allele frequency, the mechanism of 
Mendelian inheritance, by itself, keeps the allele frequencies constant and 
thus preserves genetic variation. A second item of interest is that the Hardy- 
Weinberg frequencies are attained in just one generation of random mating if 
the allele frequencies are the same in males and females. This, however, is 
true only with nonoverlapping generations; in populations with more com- 
plex life histories, the Hardy-Weinberg frequencies are attained gradually 
over a period of several generations. 

It is important to note here that conventional statistical tests for Hardy- 
Weinberg proportions (such as the y} test discussed below) are not very sen- 
sitive to deviations from the expected genotype frequencies. Consequently, 
the mere fact that observed genotype frequencies may happen to fit the 
Hardy-Weinberg proportions cannot be taken as evidence that all of the 
assumptions underlying the model are valid. The most that can be concluded 
is that, whatever departures from the assumptions there may be, they are 
not sufficiently large to result in deviations from HWE that are detectable 
with conventional statistical tests. 

The Hardy-Weinberg Principle in Operation 

Application of the Hardy-Weinberg principle can be illustrated with data on 
the MN blood groups in a British population. In a sample of 1000 people 
(Race and Sanger 1975), the observed phenotypes were 298 blood group M 

Organization of Genetic Variation 81 

(indicating genotype MM), 489 blood group MN (indicahng genotype MN) 
and 213 blood group N (indicating genotype NN) To determine whether 
these genotype frequencies are in accord with HWE, the allele frequencies of 
M and N must first be estimated, The estimated allele frequency ft of M is 
1085/2000 = 0.5425 and that q of N is 915/2000 = 0.4575. (For the details see 
Problem 1.4 in Chapter 1.) Were the population in HWE, we would expect the 
genotype frequencies of MM, MN, and NN to be p\ 2pq, and q\ respectively 
where p and q are the allele frequencies in the underlying population from 
which the sample was drawn. Because p and q are parameters, their true val- 
ues are unknown. However, in testing for HWE we can substitute the esti- 
mated values to obtain the expected proportions MM: (0.5425) 2 = 2943 MN 
2(0.5425)(0 4575) = 0.4964, and NN: (G.4575) 2 = 0.2093, respectively. Because 
the sample size is 1000, the expected numbers of the MM, MN, and NN geno- 
types are 0.2943 x 1000 = 294.3, 0.4964 x 1000 = 496.4, and 0.2093 x 1000 = 
209.3, respectively. 

At this point, it is convenient to tabulate the data into three columns, the 
first giving the genotypes, the second giving the observed numbers, and the 
third giving the expected numbers: 

MM 298 294.3 
MN 489 496.4 
NN 213 209.3 

With the data so arrayed, it is evident that the fit between the observed 
numbers and the expected numbers, though not perfect because of chance 
statistical fluctuations in the number of each genotype that may be included 
in any given sample, is nevertheless very close. To verify this conclusion, we 
will apply a conventional statistical test to the data in order to assess quanti- 
tatively the closeness of fit. A test commonly employed in population genet- 
ics is called the chi-square test, which is based on the value of a number, 
called x , calculated from the data as 

2 _ y fobs - exp) 2 



where obs refers to the observed number in any genotypic class, exp refers to 
the expected number in the same genotypic class, and the I sign denotes that 
the values are to be summed over all genotypic classes. In the case at hand, 

X 2 = (298 - 294.3) 2 /294.3 

t- (489 - 496.4) 2 /496.4 

f (213 -209.3)7209.3 

= 0.222 

82 Chapter 3 

Organization of Genetic Variation 83 

To be completely unambiguous, some statisticians prefer use of the sym- 
bol X 2 for the realized value of the test statistic defined Equation 3.3, in order 
to distinguish between the test statistic and the true j£ 2 distribution itself. The 
distinction should certainly be kept in mind, but we will not recognize it for- 
mally with different symbols. 

Associated with any % 2 value is a second number called the degrees of 
freedom for that y^. In general, the number of degrees of freedom {df) associ- 
ated with a x 2 value equals 

df= Number of classes of data 

- Number of parameters estimated from the data 


In the MN example, there are three classes of data and one parameter (p) 
estimated from the data, and so df =3-1-1 = 1. Note that a degree of free- 
dom is not subtracted for estimating q because of the relation q - 1 - p; that is, 
once p has been estimated, the estimate of q is automatically fixed, and so we 
deduct just I he one degree of freedom corresponding to p. 

Calculation of % 2 and its associated degrees of freedom is carried out in 
order to obtain a number for assessing goodness of lit; the number is deter- 
mined from Figure 3.3. To use the chart, find the value of x 2 along the hori- 
zontal axis, then move vertically from this value until the line for the number 
of degree of freedom is intersected, then move horizontally from the point of 
intersection to the vertical axis and read the corresponding probability value 
P, In our case, with % 2 = 0.222 and one degree of freedom, the corresponding 
probability value is about P = 0.67. The probability associated with a particu- 
lar x 2 test has the following interpretation: it is the probability that chance 
alone could produce a deviation between the observed and expected values 
at least as great as the deviation actually realized. Thus, if the probability is 
large, it means that chance alone could account for the deviation, and it 
strengthens our confidence in the validity of the model used to obtain the 
expectations — in this case, the Hardy -Weinberg model. On the other hand, if 
the probability associated with the X 2 is small, it means that chance alone is 
not likely to lead to a deviation as large as actually realized, and it under- 
mines our confidence in the validity of the model. Where exactly the cutoff 
should be between a "large" probability and a "small" one is, of course, not 
obvious, but there is an established guideline to follow. If the probability is 
less than 0.05, then the goodness of fit is considered sufficiently poor that the 
model is judged invalid for the data; alternatively, if the probability is greater 
than 0.05, the fit is considered sufficiently close that the model is not rejected. 
Because the probability in the MN example is 0.67, which is greater than 0.05, 
wp have no reason to reject the hypothesis that the genotype frequencies are 
in Hardy -Weinberg proportions for this gene. 

o nan 

20 18 If. 14 12 104 8 7 
Calculated x'vtihu' 

Figure 3.3 Graph of Z . To use the graph, find the value of y} along the hori- 
zontal axis, then read the probability value for the appropriate number of 
degrees of freedom from the vertical axis {From Hart) 1994 } 

PROBLEM 3.4 In the Ss blood group, related to the MN system, 
three phenofypes corresponding to the genotypes SS, Ss, and ss can 
be identified by appropriate reagents. Among the same 1000 British 
people who gave the MN data above, the observed number of each 
genotype for the Ss blood groups were 99 SS, 418 Ss, and 483 ss. 


84 Chapter 3 

Estimate the allele frequency of S (p) and s (q) and carry out a % z test of 
goodness of fit between the observed genotype frequencies and their 
Hardy -Weinberg expectations. Is there any reason to reject the 
hypothesis of Hardy-Weinberg proportions for this gene? 

ANSWER p = 0.308 and q = 0.692. The expected numbers of SS, $$, 
and ss are 94.86, 426.27, and 478.86, respectively. The x* = 0.377 with 
one degree of freedom. The associated probability from Figure 3.3 is 
about 0.55, so there is no reason to reject the hypothesis of HWE. 

Complications of Dominance 

Dominance obscures the one-to-one relation between phenotype and geno- 
type, but the allele frequencies can still be estimated if one is willing to 
assume HWE. For a polymorphic gene with two alleles in which one of the 
alleles is dominant, only two phenotypic classes can be distinguished — the 
dominant phenotype and the recessive phenotype. An example is the D allele 
in the human Rh blood groups, which codes for an Rh ' antigen present on 
the surface of red blood cells. An alternative allele designated d, fails to code 
for the antigen. The allele D is dominant over d because both DD and Dd 
genotypes produce the Rh 4 antigen. The genotypes DD and Dd therefore 
have the Rh" phenotype and are said to be Rh positive; the dd genotype has 
the phenotype Rh and is said to be Rh negative. At the molecular level, the 
Dd genotype might be expected to produce only half as much antigen as DD 
because it contains only one D allele, but the phenotype is nevertheless Rh 

Among American Caucasians, the frequency of Rh* is about 85,8% and 
the frequency of Rh" is about 14.2% (Mourant et al. 1976). Given only the 
phenotype frequencies, the data cannot be used to calculate the genotype fre- 
quencies because we have no way of knowing what proportion of Rh + phe- 
notypes are DD and what proportion are Dd. However, if we are willing to 
assume random mating, then the relative proportions DD and Dd genotypes 
are given by the Hardy-Weinberg principle. Assuming random mating and 
HWE, the genotype frequencies are given by p 1 , 2pq, and q 2 , where p is the 
allele frequency of D. An estimate of q can therefore be obtained by setting q 
= 142 (the frequency of the homozygous recessive phenotype), and so q = 
■Jq, ] 42 = 0.3768. More generally, if R ss the frequency of homozygous reces- 

Organization of Genetic Variation 85 

sive genotypes found in sample of n organisms, then q and its standard error 
are estimated as 

■ = Vr 

SE(q) = 



With q estimated from Equation 3.4 as 0.3768, then p = 1 - 0.3768 = 
0.6232, and the frequencies of DD, Dd, and dd are expected to be p 2 = (0.6232) 2 
= 0.3884, 2pq = 2(0.6232)(0 3768) = 0.4696, and q 2 = (0.3768) 2 = 0,1420, respec- 
tively The proportion of Rh 4 people that are actually heterozygous is there- 
fore 0.4696/(0.4696 + 0.3884) = 54.7%. However, when there is dominance, 
there is no possibility for a % 2 test of goodness of fit to HWE because there are 
degrees of freedom. The lack of degrees of freedom is the reason why the 
calculated frequencies of Rh + and Rh' (0.3884 + 0.4696 = 0.858 and 0.142, 
respectively) fit the observed frequencies exactly. 

PROBLEM 3.5 The Basque people, who live in the Pyrenees moun- 
tains between France and Spain, have one of the highest frequencies 
of the d allele in the Rh system so far reported. In one study of 400 
Basques, 230 were found to be Rh + and 170 Rh" (Mourant et al. 1 976). 
Estimate the frequencies of the D and d alleles, the genotype frequen- 
cies, and the proportion of Rh + people who are heterozygous Dtf. 
What is, the standard error of the estimate ^? 

ANSWER: q = V(170/400) = 0.65, p = 0.35, and the estimated geno- 
type frequencies of DD, Dd, and dd are 0.121, 0.454, and 0.425, respec- 
tively. The proportion of Dd among Rh + phenotypes in the Basque 
popula tion is 0.454/(0.12 1 + 0.454) = 79%. The standard error of 3 
equals V[(l - 0.425)/1600] = 0.02. 

The Hardy-Weinberg principle also finds application in studies of industri- 
al melanism, one of the most famous and best-studied cases of evolution in 
action (Kettlewell 1973). Industrial melanism refers to the evolution of black 
(melanic) color patterns in several species of moths that accompanied progres- 
sive pollution of the environment by coal soof during the industrial revolution. 

86 Chapter 3 

(The various color forms of the moths are known as nnorphs.) The evolution of 
melanism has been observed in Great Britain, West Germany, Eastern Europe, 
the United States, and in other heavily industrialized areas. The species that 
evolve melanism are typically large moths that fly by night and rest in a sort of 
cataleptic state by day, often on the trunks of trees, using their cryptic black- 
and-white mottled color pattern for concealment from visually cued predators 
such as hedge sparrows, redstarts, and robins (Figure 3.4). Of nearly 8Q0 
species of large moths in the British Isles, where industrial melanism has been 
most intensively studied, about 100 species are industrial melanics (Bishop and 
Cook 1975) The best known of these are the peppered moth (Btskm bctularin) 
and the scalloped hazel moth (Gonodantis bidentata). In most instances, the 
melanic color pattern has been found to be due to a single dominant allele. 

PROBLEM 3,6 In one study of a heavily polluted area near Birm- 
ingham, England, Kettlewell (1956) observed a frequency of 87% 
melanic Biston betularia. Estimate the frequency of the dominant allele 
leading to melanism in this population and the frequency of melan- 
ics that are heterozygous. 

Figure 3.4 Melanic and nonmelanic moths, showing camouflage of light moths 

on light background and dark moths on dark. (Photograph by H B D. Kettlewell.) 

Organization of Genetic Variation 87 

ANSWER The observed frequency of homozygous recessives is R = 
0,13, a nd so the frequency of recessive allele is estimated as q - 
V(0.13) = 0.36. Assuming random mating, the expected frequencies of 
dominant homozygotes, heterozygotes, and recessive homozygotes 
are 0.41, 0.46, and 0.13, respectively. The proportion of melanics that 
are heterozygous is 0.46/0.87 = 52,9%. 

Frequency of Heterozygotes 

The Hardy -Weinberg principle also has important implications for the fre- 
quency of heterozygotes carrying rare recessive alleles. The graphs in Figure 
3.5 depict the frequencies of AA, Aa, and an in a population in HWE. The het- 
erozygotes are most frequent when the allele frequencies are 0.5 Suppose 
thai the allele a is a recessive, and consider the curves as the allele frequency 
of a goes toward 0. As a becomes rare, the frequencies of recessive homozy- 
gotes and heterozygotes both decrease, but the frequency of the recessive 
homozygote is much lower. As the frequency of /> goes to 0, the frequency of 
recessive homozygotes goes to at a rate of q 2 , whereas the frequency of het- 
erozygotes goes to at a rate of 2pq. The result is that the ratio of heterozy- 

Frequency of A alli'le 
0.6 4 

4 Oft 

Frequency of" i? allele 

Figure 3.5 Frequencies ol AA, Aa, and m genotypes with HWE. Note, as 
either allele becomes more rare, the frequency of homozygotes for thai allele is 
much lower than the frequency of heterozygotes 

88 Chapter 3 

gotcs to recessive homozygotes increases wilhoul limit as the recessive allele 
becomes rare 

To illustrate the principle, suppose q - 0.10; then 2pq/q 2 = 18, meaning 
that there are 18 times as many heterozygotes as recessive homozygotcs For 
q = 0.01, to take a more extreme example, the ratio is 198; and for q = 0,001, 
the ratio is 1998. These examples demonstrate that when a recessive allele is 
rare, most genotypes containing the rare allele are heterozygous. 

Quantitatively, the ratio of heterozygotes to homozygotes equals 2pq/q 2 - 
21 q - 2 which, for small q, is approximately 2/q. Consequently, the excess of 
heterozygotes over homozygotes becomes progressively greater as the reces- 
sive allele becomes more rare. To take a real example, consider cystic fibro- 
sis, an autosomal-recessive defect in chloride transport characterized by 
abnormal glandular secretions, impaired digestion, frequent respiratory 
infections, and other serious symptoms. The frequency of the homozygous 
recessive genotype in newb orn Caucasians is approximately 1 in 1700. 
For this allele, q = V(l/1700) = 0.024. Assuming random mating, the fre- 
quency of heterozygotes is estimated as 2(0.024)(1 - 0.024) = 0.047, or about 1 
in 2 1 . In other words, although only 1 person in 1 700 is actually affected with 
cystic fibrosis, 1 person in 21 is a heterozygous carrier of the harmful allele. 

PROBLEM 3.7 Phenylketonuria is a defect in phenylalanine meta1> 
olisrn caused by lack of a functioning allele. Over 200 defective alle- 
les have been identified and most affected individuals are actually 
heterozygous for two different defective alleles. The condition affects 
about 1 in 10,000 newborn Caucasians. Estimate the frequency of het- 
erozygotes for the normal and a defective allele under Ihe assumption 
of random mating. 

ANSWER About 1 person in 50 carries a defective allele. 


In this section we extend the Hardy-Weinberg principle to multiple alleles 
and to genes located on the X chromosome. 

Three or More Alleles 

Genotype frequencies undpr random mating for genes with three alleles are 
shown in Figure 3.6. Here it is convenient to label the alleles as A,, A 2 , and A 3 

MMv A } 

FrotjiK'ncy f. 

Allele Frequency 

Female a 2 


Organization of Genelic Variation 89 


A t A 2 


A 2 Ay 


A 2 A 2 

A 2 A, 


A,A 2 

Figure 3.6 Cross-multiplication square showing Hardy-Weinberg frequence 
for three autosomal alleles. 

and the corresponding allele frequencies as p u p 2 , and p v Because there are 
only three alleles, p, + p 2 + p , = 1. With three alleles there are six diploid geno- 
types, and under random mating their expected frequencies are as follows- 

A X A 2 

A 2 A 2 


A 2 A, 


2pip 2 



These frequencies can be obtained by expanding (p, A x + p 2 A z + p 3 A 3 ) z , 
which the cross-multiplication square in Figure 3.6 does automatically. 

Application of Figure 3.6 can be illustrated with the familiar ABO blood 
groups in humans. The ABO blood groups are controlled by three alleles des- 
ignated I , V\ and f. Genotypes J V and I A J° have blood type A; genotypes 

90 Chapter 3 

1 B I B and J H !° have blood type B, genotype l°I° has blood type O, and geno- 
type fV has blood type AB. In one test of 6313 Caucasians in Iowa City, the 
number of people with blood types A, B, O, and AB was found to be 2625, 
570, 2892, and 226, respectively {Mourant et al. 1976). The best estimates 
of allele frequency in this case are p, = 0.2593 (for l A ), p 2 = 0.0625 (for l B ), and 
p 3 ~ 0.6755 (for 1°). (Estimation of allele frequencies for the ABO blood groups 
is complicated because of dominance; for methods see Cavalli-Sforza and 
Bodmer 1971 and Vogel and Motulsky 1986 ) The expected (and observed) 
numbers of the four blood-type phenotypes are therefore: 

(0.2593 2 + 2 x 0.2593 x 0.6755) x 631 3 = 2636.0 (observed 2625) 


(0.0652 2 + 2 x 0.0652 x 0.6755) x 6313 = 582.9 
0.6755 2 x6313 = 2880.6 

AB: (2 x 0.2593 x 0.0652) x 6313 = 213.5 

(observed 570) 
(observed 2892) 
(observed 226) 

The x 2 for goodness of fit to Hardy-Weinberg proportions is 1 .11. There is 
one degree of freedom for this test: 4 (to start with) - 1 (for fixing the total at 
6313) - 1 (for estimating j&] from the data) - 1 (for estimating p 2 from the 
data); a degree of freedom is not deducted for estimating p s because p 3 = 1 - 
Pi ~ Pi- P° r a X* °f 1 -11 with one degree of freedom, the associated probabili- 
ty from Figure 3.3 is about 0.30, and so the Iowa City population gives no evi- 
dence against Hardy-Weinberg proportions for this gene. 

PROBLEM 3.8 In a sample of 1617 Spanish Basques, the numbers of 
A, B, O, and AB blood types observed were 724, 110, 763, and 20, 
respectively (Mourant et al. 1976), The best estimates of allele fre- 
quency arepj = 0.2661 (for I A ),p 2 = 0.0411 (for 1% andp 3 =* 0.6928 (for 
r). Calculate the expected numbers of the four phenotypes and carry 
out a % 2 test for goodness of fit to the Hardy-Weinberg expectations. 

ANSWER The expected numbers of A, B, O, and AB are 710.7, 94.8, 
776.1, and 35.4, respectively. The % 2 equals 9.61 with one degree of 
freedom, for which the corresponding probability is 0.0025. Because 
a deviation as large or larger than that observed would be expected by 
chance in only 0.0025 samples (that is, about 1 in 400), there is very 
good reason to reject the hypothesis that the genotypes are in Hardy- 
Weinberg proportions in this population. The reason for the discrep- 

Qrganization of Genetic Variation 91 

ancy is not known. One likely possibility is migration into the popu- 
lation by people with allele frequencies that are significantly different 
from those among the Basques themselves. 

PROBLEM 3.9 Among many aboriginal American Indian tribes, the 
allele frequency of f B is extremely low. For example, a sample of 600 
Papago Indians from Arizona included 37 A and 563 O blood types 
(Mourant et al. 1976). What are the best estimates of the allele fre- 
quencies of I A , I B , and 1° m mis population, and what are the expect- 
ed genotype frequencies assurning random mating? 

ANSWER There are no I B alleles in the sample, so the best estimate 
of pj is 0. Thus, there are only two alleles l A and 1° with l A dominant. 
The best e stimate of p 3 is thus obtained from Equation 3.4 as 
V(S63/600) * 0.9687 and that of p x as 1 - p 3 = 0.0313. The expected 
genotype frequencies are 0.0313 2 = 0.0010 for I A 1 A , 2(0.0313)(0.9687) = 
0.0606 for I A P, and 0.9687 2 = 0.9384 for I°I°. 

In general, if there are n alleles 

A V A 2 A„ 

with respective frequencies 

V\. P* 


(and Pi + p 2 + ■ • ■ + p„ = 1), then the genotype frequencies expected under 
random mating are 

2 P,P, 

for A,A, homozygotes 
for A,Aj heterozygotes 


Equation 3.5 may be applied to data on allozyme polymorphisms in 
Dwsophila pcrsimilis in California. One sample of 108 adult flies from the Fish 
Creek population included four alleles of the gene Xdlt, which codes for 

92 Chapter 3 

xanthine dehydiogenase We mav call the alleles Xdh-1 , Xdlt-2, Xdh-3, and 
Xdti-4; thoi r respective frequencies were estimated .is {> , = 08, f) 2 = 0.21 , p 3 = 
62, and p,= 0.09 (Prakash W7) With four alleles, there are four possible 
homozygotes (for example, Xdh-l/Xdh-1) and six possible heterozygotes (for 
example, Xdh-1 /Xdh-2). In a random- ma ting population, the frequency of any 
homozygous genotype is expected to be the square of (he corresponding 
allele frequency. For example, the frequency of Xdh-1 f Xdh-1 is expected to be 
;?i 2 , and the frequency of any heterozygous genotype is expected to be two 
times the product of the corresponding allele frequencies. For example, the 
frequency of Xdh-l/Xdh-2 is expected to be 2p i p z . The Hardy-Weinberg fre- 
quencies for all 10 possible genotypes can be obtained by expanding the 
expression (0 08 Xdh-1 + 0.21 Xdh-2 + 0.62 Xdli-3 + 0.09 Xdh-4) 1 . 

PROBLEM 3.10 Four alleles of the gene Adh coding for alcohol 
dehydrogenase were found in a Texas population of Phlox cuspidata 
(Levin 1978). The alleles may be designated Adh-1, Adh-2, Adh-3, and 
Adh-4. Their frequencies were estimated as 0.11, 0.84, 0.01, and 0.04, 
respectively. What are the expected Hardy-Weinberg proportions of 
the 10 genotypes? 

ANSWER Adh-l/Adh-1: 0.11 2 = 0.0121; Adh-l/Adh-2: 2{0.11)(0.84) = 
0.1848; Adh-VAdh-2 = 0.84 2 = 0.7056; Adh-l/Adh-3 = 2(0.11)(0.01) = 
0.0022; Adk-2/Adh-3 = 2(0.84)(0.01) = 0.0168; Adh-3/Adh-3 = 0.01 2 
= 0.0001; Adh-l/Adk-4 = 2(0.11){0.04) = 0.0088; Adh-2/Adh-4 = 
2(0.84)(0.04) = 0.0672; Adh-3/Adlt-4 = 2(0.01)(0.04) = 0.0008; Adh-4/Adh- 
4 = 0.04 2 = 0.0016. It should be pointed out that the observed genotype 
frequencies were nowhere near the Hardy-Weinberg expectations 
because Phlox cuspidata undergoes a substantial frequency of self- 
fertilization (about 78%), which violates the assumption of random 
mating. How to deal with such departures from random mating is 
discussed in Chapter 4. 

X-Unked Genet 

An important exception to the rule that diploid organisms contain two alleles 
of every gene applies to genes on the X and Y chromosomes. In mammals 

and many insects, females have two copies of the X chromosome whereas 
males have one X chromosome and one Y chromosome. The X and Y 

Organization of Genetic Variation 93 

chromosomes segregate, and so half the sperm from a male carry the X chro- 
mosome and half carry the Y chromosome Although the Y chromosome car- 
ries very few genes other than those involved in the determination of sex and 
male fertility, the X chromosome carries as full a complement of genes as 
any other chromosome. Genes on the X chromosome are called X-linked 
genes, and the important consequence of X linkage is that a recessive allele 
on the X chromosome in a male is expressed phenotypkally because the Y 
chromosome lacks any compensating allele. For X-linked genes with two 
alleles, therefore, there are three female genotypes (A/1, An, and an) but only 
two male genotypes (A and a). 

The consequences ol random mating with two X-linked alleles are shown 
in Figure 3.7, where the alleles are denoted X' 1 and X". Note that in females, 
which have two X chromosomes, the genotype frequencies are as given by 
the Hardy-Weinberg principle in Equation 3.1; in males, which have only one 

Male gametes 


Allele X A 

Frequency p 




Allele Frequency 
X* p 

X" a 

X A X A 

P 2 

X A X" 


X°X A 

X°X a 
|J 2 

Summed frequencies an zygotes 



X A X A - p 2 

X A Y' p 

X A X a 2p<j 

X"Y q 

X"X" <f 

Figure 3.7 Consequences of random mating with X-linked genes Genotype 
frequencies in females equal the Hardy-Weinberg frequencies, and genotype 
frequencies in males equal the allele frequencies 

94 Chapter 3 

Organization of Genetic Variation 95 

X chromosome, the genotype frequencies are equal to the allele frequencies. 
The calculations in Figure 3.7 are valid only if the allele frequencies are iden- 
tical in eggs and sperm. When they differ, approximate equality of allele fre- 
quencies in the sexes is usually attained for X-linked genes in a period of 10 
or so generations of random mating because, in each generation, any allele 
frequency in female zygotes is the average of the frequency of the allele in 
male and female parents in the previous generation. 

PROBLEM 3. 1 1 The human Xg blood group is controlled by an X- 
linked gene with two alleles, designated Xg" and Xg. Two phenotypes 
can be distinguished by means of the appropriate antisera, Xg(a+) and 
Xg(a-). Xg* is dominant to Xg, and so females of genotype Xg"/Xg a 
and Xg a /Xg have blood type Xg(a+), whereas females of genotype 
Xg/Xg are phenotypically Xg(a-). Males of genotype Xg* have blood 
type Xg(a+); those of genotype Xg have blood type Xg(a-). In a sam- 
ple of 2082 British people, there were 967 Xg(a+) females, 667 Xg(a+) 
males, 102 Xg(a-) females, and 346 Xg(a~) males (Race and Sanger 
1975). The best estimates of allele frequency are p = 0.675 (for Xg*) 
and q = 0.325 (for Xg). Calculate the expected numbers in the four 
phenotypic classes, assuming random-mating proportions, and carry 
out a i test for goodness of fit. (The number of degrees of freedom in 
this case is 1: there are four degrees of freedom to start with; one must 
be deducted for using the observed number of males in calculating 
the expectations for males; one must be deducted for using the 
observed number of females in calculating their expectations; and one 
more must be deducted for estimating p from the data.) 

ANSWER The expected numbers of Xg(a-f) and Xg(a-) males are 
0.675 x 1013 = 683.8 and 0.325 x 1013 = 329.2, respectively. The expect- 
ed numbers of Xg(a+) and Xg(a-) females are [0.675* + 2(0.675)(0325)] 
x 1069 = 956.1 and 0.325* x 1069 = 112.9, respectively. The % 2 equals 
2 45 which, as noted above, has one degree of freedom. The associat- 
ed probability is about 0.12 (Figure 3.3), and so there is no reason to 
reject the hypothesis of random-mating proportions. 

One of the important features of random mating for X-linked genes is that 

phenotypes resulting from a recessive allele will be more common in males 
than in females In Problem 3.11, for example, the proportion of Xg(a-) males 
is 346/1013 = 34%, whereas the proportion of Xg(a-) females is only 

102/ 1069 = 10%. There is always an excess of affected males because q (which 
equals the proportion of males with the recessive phenotype) will always be 
greater than q (which is the proportion of females with the recessive pheno- 
type). Indeed, the discrepancy grows larger as the recessive allele becomes 
more rare. For example, with the X-linked "green" 1ype of color blindness, q 
= 0.05 in Western Europeans, and so the ratio of affected males to affected 
females is q/q 2 = l/q = 1/0.05 = 20. In contrast, for the X-linked "red" type 
of color blindness, q = 0.01 and so, in this case, the ratio of affected males to 
affected females is 1 /0.0] = 100. 

PROBLEM 3. 1 1 California populations of Drosophik persimilis have 
two alleles of art X-linked gene coding for aUozymes of phosphoglu- 
comutase-1 (Policansky and Zouros 1977). The alleles may be desig- 
nated Pgm-1 A a*td Pgm-1®; their estimates frequencies were 0.25 and 
0.75, respectively. AsSurhingjrandom-niating proportions, what are 
the expected genotype frequencies in males and females? 


ANSWER In males, Ppn-l A at 0,25 sundPgm-1 B at 0.75. In females, 
Pgm-l A fPgm-l A at 0.25 r = 0.0625; >gm-l A f Pgm- J B at 2(0.25)(0.75) = 
0.3750; Pgm-f/Pgm-l B at 0.75* * 0.5625. 

Before leaving the subject of X-linkage, it is necessary to point out that 
certain species— among them, birds, moths, and butterflies— have the sex- 
chromosome situation backwards. In these species, females are XYand males 
XX. The consequences of random mating are the same as otherwise, except 
that the sexes are reversed. 


With random mating, the alleles of any gene are combined at random into 
genotypes according to frequencies given by the Hardy- Weinberg propor- 
tions. To be specific, imagine a gene with two alleles, call them' A } and A 2 , at 
frequencies p, and p 2 , respectively, where p } + p 2 = L Then the Hardy- 
Weinberg principle tells us thai genotypes A X A U A,A 2 , and A 2 A 2 are expected 
'n the proportions/??, 2p t p 2f and pi respectively, provided that mating is 

Similarly, we may consider a different gene with alleles 8, and B 2 at fre- 
quencies q, and q 2 , respectively, where ij, + q 2 = ] . Then the Hardy-Weinheig 
principle tells us again that the genotype frequencies of K,B,, H|fl 2 , and B 2 B 2 

96 Chapter 3 

arc expected in the proportions tf { , 2(\\qir a ^^ f ji respectively, provided that 
rndtin^ is random. Thus, the A { allele is in random association with the A 2 
allele, rind the B { allele is in random association with the B 2 allele. Strange as 
it may seem, the alleles of the A gene may nevertheless fail to he in random 
association with the alleles of the B gene. The precise meaning of "random 
association" is illustrated in Figure 3 8. In this figure the squares refer to the 
alleles present in gametes, not to genotypes as in earlier diagrams. When the 
alleles of the genes are in random association, the frequency of a gamete car- 
rying any particular comhination of alleles equals the product of the fre- 
quencies of those alleles. Genes that are in random association are said to be 
in a state of linkage equilibrium, and genes not in random association are 
said to be in linkage disequilibrium With linkage equilibrium, therefore, 
the gametic frequencies are: 

A,B X 
A,B 2 
A 2 B, 


Fl x 4l 

Pi x Hz 
Pi x % 



With random mating and the other simplifying assumptions listed earlier 
(including a large population with no mutation, migration, or selection), link- 
age equilibrium between genes is eventually attained. However, linkage 
equilibrium is attained gradually, and the rate of approach can be very slow. 
The slow approach to linkage equilibrium stands in contrast to the attain- 
ment of HWE with alleles of a single gene, which typically requires just one 
generation (when generations are nonoverlappirtg) or a relatively small num- 
ber of generations (when generations are overlapping). 

The rate of approach to linkage equilibrium depends on the rate of recom- 
bination in genotypes heterozygous for both genes. There are two types of 
double heterozygotes: 

A X BJA 2 B 2 

A ] B 2 /A 2 B l 

In the first case, the genotype was formed by the union of an AjB t gamete 
with an A 2 B 2 gamete. In the second case, the genotype was formed by the 
union of an A } B 2 gamete with an A 2 B { gameie. For the moment, consider the 
genotype A i B 1 /A 2 B 2 . The gametes produced by this genotype are of four 
types. (1) AiB u (2) A 2 B 2 , (3) A\B 2 , and (4) A 2 B } . Gametic types 1 and 2 are 
known as nonrecombinant gametes because the alleles are associated in the 
same manner as in the previous generation (specifically, 4, with B, and A 2 
with Bi). Gametic types 3 and 4 are known as recombinant gametes because 

Organization of Genetic Variation 97 

Alleles ol A gene 

Allele A } 

Frequency p } 

Allele Frequency 

of B gene 



A 2 B, 

A 2^1 

Figure 3.8 Random association between two alleles of each of two genes, 
showing expected gametic frequencies when the alleles are in linkage 

the alleles are associated differently than in the previous generation (specifi- 
cally. Ay with B 2 and A 2 with J?,). 

Because of Mendelian segregation, the frequency of gametic type 1 equals 
that of type 2, and the frequency of gametic type 3 equals that of type 4. That 
is, the two nonrecombinant gametes are formed in equal frequencies, and the 
two recombinant gametes are formed in equal frequencies. However, the 
overall frequency of recombinant gametes (type 3 + type 4) does not neces- 
sarily equal the overall frequency of nonrecombinant gametes (type 1 + type 
2) except in special cases. The term recombination fraction, usually symbol- 
ized r, refers to the proportion of recombinant gametes produced by a double 
heterozygote. Suppose, for example, that the genotype A ,6,/ A 2 B 2 produces 
gametes A } B U A 2 B 2 , A^B 2t and A 2 B y in the proportions 0.38, 0.38, 12, and 
0.12, respectively Then the recombination fraction between the genes is r = 
0.12 + 0.12 = 0.24. 

The recombination fraction between genes depends on whether they are 
present on the same chromosome and, if so, on the physical distance between 
them. For genes on different chromosomes, the recombination fraction is r = 
0.5 because the four possible gametic types are produced in equal frequency 
For genes on the same chromosome, the recombination fraction depends on 
their distance apart, because each chromosome aligns side-by-side with its 
partner chromosome in meiosis and can undergo a sort of breakage and 

98 Chapter 3 

reunion resulting in an exchange of parts between the partner chromosomes. 
The closer two genes are, the less likely that a breakage and reunion takes 
place in the region between the genes, the farther apart two genes are, the 
more likely such an event becomes. The smallest possible recombination frac- 
tion is r = 0, which would imply thai the two genes are so close together that 
a break never takes place between them. The largest possible recombination 
fraction is r = 0.5, which is found when genes are very far apart on the same 
chromosome or, as noted above, when they are on different chromosomes. 
Genes for which the recombination fraction is less than 05 must necessarily 
be on the same chromosome, and such genes are said to be linked. 

To sum up, if the recombination fraction between the A and B genes is denot- 
ed r, then the genotype A , B, / A 2 B 2 produces the following types of gametes: 

A]R] with frequency (1 - r)/2 

A 2 B 2 with frequency (1 - r)/2 

/1,B 2 with frequency r/2 

A 2 B\ with frequency r/2 

The situation in A]B 2 /A y B 2 genotype is much the same, but there is one 
important difference. In this case, the /t,B, and A 2 B 2 gametes are the recombi- 
nant /i/pps, and the A]B 2 and A 2 B\ gametes are the nomccomhmant types. Thus, 

the genotype A , B 2 M, #2 produces the following types of gametes: 

A X B { with frequency r/2 
A 2 B 2 with frequency r/2 

A y B 2 with frequency (1 - r)/2 
A 2 B { with frequency (1 - r)/l 

PROBLEM 3,1 3 The genes for the human MN and Ss blood groups 
discussed in Problem 3.4 are close together on the same chromosome. 
Suppose that the recombination fraction between the genes is r = 0.01. 
What types and frequencies of gametes would be produced by a per- 
son of genotype MS /Ns? By a perron of genotype Ms/NS? 

ANSWER The MS/Ns genotype produces gametic types MS, Ns, 
Ms, and NS in proportions (1 - 0.01 )/2 = 0.495, (1 - 0.01)/2 = 0.495, 

Organization of Genetic Variation 99 

0.01/2 * 0,005, and 0.01/2 = 0.005, respectively. The Ms/NS genotype 
produces exactly the same gantetic types, but their frequencies are 
0.005, 0.005, 0.495, and 0.495, respectively. 

The recombination fraction between genes is important in population 
genetics because it governs the rate of approach to linkage equilibrium To be 
precise, consider a population in which the actual frequencies of the chromo- 
some types among gametes are as follows: 

A X B X : P n 

Afe P ]2 

A 2 Bv P 2 , 

A 2 B 2 : P 22 

where Pu + P, 2 + P 2] + P 22 = 1. In terms of the gametic frequencies, linkage 
equilibrium is defined as the state in which P n = p,<Ji, P l2 = P\(j 2 , P 2 1 = Pity, 
and P z2 = p 2 q 2 (see Figure 3.8). 

Suppose that the genes are not in linkage equilibrium. To determine how 
rapidly linkage equilibrium is approached, we need to deduce the gametic fre- 
quencies in the next generation. Consider first the ^4,6, gamete. In any one 
generation, a chromosome carrying A& either cou Id have undergone recom- 
bination between the genes (an event with probabi h ty r, where r is the recom- 
bination fraction), or could have escaped recombination between the genes 
(an event with probability 1 - r). Among the/1,5, chromosomes that did not 
undergo recombination, the frequency of AB, is I he same as it was in the pre- 
vious generation; among the chromosomes that did undergo recombination, 
the frequency of ^,8, chromosomes is simply the frequency of -B,/^,- geno- 
types in the previous generation, where the dash in place of the A and B allele 
means that the identity of that particular allele is irrelevant. Because mating is 
random, the overall frequency of -B\/A x - genotypes is />,</,. Putting all the 
steps in the argument together, the frequency of A , B, in any generation, call it 
Pn' is related to the frequency P n in the previous generation by the equation 

P n '= (1 - r) x P n [for the nonrecombinants] 

+ rxpiqi [for the recombinants) 

Subtraction of p : q } from both sides leads to 

P]r'-Pi'Ji=<l-'')(/ , n-pii7)) 3 7 


TOO Chapter 3 

Equation 3.7 becomes simplified somewhat by defining D as the differ- 
ence Pn - p\(\\. Then D„ is Ihe value of D in the nth generation, and Equation 
3.7 implies that D„ = (1 - r)D„_, The solution of this equation is found by suc- 
cessive substitution as 

D„ = (1 ~ r)D,^ = (1 - r) 2 D„_ 2 = •■=(!- r)" D fl 


where D„ is the value of D in the founding population Because 1 - r < 1, 
(1 - r)" goes to zero as n becomes large, but how rapidly (1 - r)" goes to zero 
depends on r; the closer r is to zero, the slower Ihe rate This principle is illus- 
trated in Figure 3.9 Recall here thai r - 0.5 corresponds either to genes far 
apart in the same chromosome or to genes in different chromosomes. 
Because (1 - r)" goes to zero, D goes to zero, and therefore P u goes to p x q\ 
unless there are other offsetting processes. Analogous arguments hold for 
gametes containing A X B 2 , A 2 B U or A 2 B 2 , and so P 12 , P 21 , and P 22 go topifc, p 2 (Ju 
and p 7 (j 2 , respectively. Thus, linkage equilibrium is attained at a rate deter- 
mined by the value of r. 

Frequency of 
- recombination, r 

Time (in generations) 

Figure 3.9 Linkage disequilibrium between genes gradually disappears when 
mating is random, provided there is no countervailing force building ii up. The 
rate of approach to linkage equilibrium depends on the recombination frequen- 
cy between the genes. The disappearance of linkage disequilibrium is gradual 
even with free recombination (r = '/ 2 ). ln these examples, the frequencies of both 
alleles at both loci equal V 2 , and the initial linkage disequilibrium is either at its 
maximum (D = 25) or minimum (D = - 0.25) value, given these allele frequen- 

Organization of Genetic Variation 101 

The value of D that holds for P M -p,<7, also holds for the other possible 
gametes, as follows 

Pn = /Vfi + D 
Pn = Piq?-D 

P 22 = p 2 q 2 + D 

The quantity D is often called the linkage disequilibrium parameter. In 
terms of the gametic frequencies, D can be shown to satisfy 

D = P„P 22 -P, 2 P 21 


With random mating and no countervailing forces, the value of D changes 
according to Equation 3.8, and D = corresponds to linkage equilibrium Fur- 
thermore, P lt , P 12 , P 21 , and P 23 must all be nonnegative and so, for any pre- 
scribed allele frequencies p u p lt q u and q 2 , the smallest possible (D mm ) and 
largest possible (D m<K ) values of D are as follows 

D mn = the larger of -p x q { and -p 2 q 2 
D max - the smaller of p x q 2 and p 2 q x 


In studies of linkage disequilibrium, estimation of the gametic frequen- 
cies P,], P 12 , P 21 , and P 22 usually requires complex statistical procedures rather 
than straightforward chromosome-counting methods because there are 10 
genotypes but usually no more than nine phenotypes. (There are 10 geno- 
types because A A BJA 2 E 1 and A\B 2 /A 2 B y must be distinguished.) 

An example of linkage disequilibrium is found in the genes controlling 
the MN and Ss blood groups in human populations. Earlier in this chapter, 
we cited data from 1000 Britishers with respect to the MN blood groups and 
showed that the genotypes MM, MN, and NN are in Hardy- Weinberg pro- 
portions. In Problem 3.4, data from the same 1000 people were analyzed with 
respect to the Ss blood groups, and genotypes SS, Ss, and ss were also found 
to satisfy the Hardy-Weinberg proportions. In order to discuss linkage dise- 
quilibrium between the genes, it will be convenient to use the symbols p ] and 
p 2 for the allele frequencies of M and N, respectively, and the symbols q x and 
q 2 for the allele frequencies of S and s, respectively. The earlier analyses yield- 
ed estimates of p, = 0.5425 and p 2 = 0.4575 for M and N and ft = 0.3080 and 
<}i = 0.6920 for S and s. Were the loci in linkage equilibrium, the gametic fre- 
quencies would be ptf , for MS, p t q 2 for Ms, p 2 q x for NS, and p 2 q 2 for Ns. There- 
fore, among the 1000 genotypes (a total of 2000 chromosomes), the expected 
numbers are as shown in the third column below (the second column gives 
the observed numbers): 

102 Chapter 3 

Organization of Genetic Variation 1 03 

MS 474 5425 x 0.3080 x 2000 - 334.2 

Ms 61 1 0.5425 x 0.6920 x 2000 = 750 8 

NS 1 42 0.4575 x 0.3080 x 2000 = 281 .8 

Ns 773 4575 x 0.6920 x 2000 = 633.2 

The x 2 for goodness of fit is 184.7 with one degree of freedom: 4 (to start 
W i t h) _t-i (for estimating p, from the data) - 1 (for estimating q { from the 
data) = L The associated probahility is so small as to be off the chart in Figure 
3.3, and consequently it is very much less than 0.0001. This result means that 
chance alone would produce a fit as poor or poorer substantially less than 
one time in 10,000, and so the hypothesis that the loci are in linkage equilib- 
rium can confidently be rejected. 

To quantify the amount of linkage disequilibrium, we must estimate the 
gametic frequencies P xu P [2l P 2] , and P 12 : 

MS: P n = 474/2000 = 2370 


P„ = 61 1/2000 = 0.3055 

NS: P 2l = 142/2000 = 0.0710 


P n = 773/2000 = 0.3865 

Thus, D can be estimated as D = P„ P 22 - P 12 P 2 , = 0.07. From Equation 
3 i 0, D aliK is given by p % q 2 or p 2 q h whichever is smaller; in this case^tfc =.038 
and p 2 q, = 014, hence D m; „ = 0.14. Therefore, D/D m:iK = 0.07/0.14 = 50%, 
and so we conclude that the amount of disequilibrium between the genes 
controlling the MN and Ss blood groups is about 50% of its theoretical maxi- 
mum. In most local populations of sexual organisms that regularly avoid 
extreme inbreeding (mating between relatives) values of D are typically zero 
or close to zero (indicating linkage equilibrium) unless the genes are very 
closely linked. This overall conclusion is exemplified in the following 

analysis to determine whether there is linkage disequilibrium 
between E6 and EC. If there is linkage disequilibrium, what is its 
magnitude relative to the theoretical maximum (or minimum) value? 

ANSWER fW the data given in Problem 2.3, the observed numbers 
of the four chromosomal types £6 F EC F , £6 f EC S , E6 S EC F , and E6 S EC S 
were 159, 16, 277, and 37, respectively. The estimated allele frequen- 
cies of £6 F , E6 S , EC f , and EC are 0.3579, 0.6421, 0.8916, and 0.1084, 
respectively. Assuming linkage equilibrium, the expected numbers of 
the four chromosomal types are 156.0, 19.0, 280.0, and 34.0, respec- 
tively. The x 2 value with one degree of freedom is 0.828, for which the 
associated probability is about 0.4, Thus, there is no reason to reject 
the hypothesis that £6 and EC are in linkage equilibrium in this 
experimental population. 

PROBLEM 3.15 Carry out an analysis of linkage disequilibrium for 
the genes EC and Odh, using the data in Problem 2.3 (page 47). A con- 
venient shortcut to obtaining the x 2 value is first to calculate 

by substituting, on the right-hand side, the estimated values for each 
of the parameters. The value of % 2 is numerically equal to p 2 N, where 
N is the total number of chromosomes examined. The biological 
meaning of p is that it is the correlation between alleles present in the 
same chromosome. 

PROBLEM 3.14 In Drosophik meknogaster, the genes E6-EC-Odh 
are linked in chromosome 3. The E6 and EC genes are rather loosely 
linked (r = 0.122), whereas EC and Odh are tightly linked (r = 0.002). 
The recombination fractions are those in females, as recombination 
does not take place in males of this species. Using the data from the 
experimental population given in Problem 2.3 (page 47), carry out an 

ANSWER For the data given in Problem 2.3, the observed numbers 
of the chromosomal types EC P Odh v , EC F Odh s , EC S Odh F , and £C S 
Qdh s were 416, 20, 44, and 9, respectively. The estimated allele fre- 
quencies of EC p f EC s t Odh F , and Odh s are 0.8916, 0,1084, 0.9407, and 
0.0593, respectively, and D = (416 x 9 - 20 x 44)/489 a = 0.0120. Thus, 
P = 0.0120/(0.8916 x 0.1084 x 0.9407 x 0.0593) 1/2 = 0.1631. Conse- 

108 Chapter 3 

approximately 1 - 2<j, 2q, and when q is so small that q 2 is approxi- 
mately 0. 

8. In a population undergoing random mating for a single gene with a dom- 
inant and recessive allele, show that the allele frequency of the recessive 
allele among individuals with the dominant phenotype is q/{\ + q), 
where q is the allele frequency of the recessive in the whole population. 

9. The frequency of one form of recessive X-linked color blindness is 5% 
among European males. What is the expected frequency of this form of 
color blindness among females? What fraction of females would be het- 
erozygous carriers? 

10. For a trait due to a rare X-linked recessive gene, show that the frequency 
of carrier females is approximately equal to two times the frequency of 
affected males. 

11. What is the analogue of the Hardy- Weinberg principle for a gene with 
two alleles in a tetraploid? 

12. Given the following table of allele frequencies: 







Allele 1 






Allele 2 






Allele 3 






Allele 4 






What is the proportion (P) of polymorphic genes (using the definition in 
the text)? Assuming random mating and linkage equilibrium, what is the 
average heterozygosity (H) for the set of genes? 

13. Charles Darwin could have discovered segregation had he known what 
to look for, as Mendelian segregation occurred in at least one of his own 
experiments. Darwin (cited in litis 1932) studied flower shape in the snap- 
dragon Antirrhinum. In a cross between a true-breeding strain with regu- 
lar (peloric) flowers and a true-breeding strain with irregular (normal) 
flowers, all of the F,'s were normal Crosses of Fj x F t yielded 88 normal 
and 37 peloric plants. Perform a % 2 test assuming a 3 : 1 ratio in the F 2 . Is 
the peloric or normal allele dominant? 

14. For a mating between triple dominant/recessive heterozygotes of three 
unlinked genes, there are eight phenotypic classes among the offspring. 
What are the expected phenotypic ratios? Mendel carried out such an 
experiment and obtained the phenotypic ratio 269 : 98 : 86 : 88 : 30 : 34 : 
27 : 7 among a total ol 639 progeny. (He complained that this experiment 
required the most time and effort of any of his crosses.) Calculate the x 2 
and associated probability. 

Organization of Genetic Variation 1 09 

15. If one gene has alleles /I, and A 2 at frequon u. ,,, , ind ,,„ , ind anothpr 
gene has alleles B u B 2/ and fl, at frequence ,,„ ,,„ tind ,, are the 
expected frequencies of gametes with linkage eqmhbrmm .issuming'that 
p r = 3, q, = 0.2, and q 2 = 0.3? h 

16. For two genes with alleles A, and A, and 0, and « 2 , respectively, p, 

^ly l X e ^T es oMl and A * and *■ md * lh - " f «■ - nd ^ 

a. What are the frequencies of all possible gametes assuming linkage 
equilibrium? h 6 

b. What are the frequencies of all possible gametes ,f there is lmkage dis- 
equJibnum with D equal to 50% of its theoietiwl maximum? 

17 Use the result in Problem 8 to show that the frequency of homozygous 
recessive genotypes from dominant x dominant malings is fo/(l + q )f 
and from dominant x recessive matings i S(J /(l f q). Note that the latter is 
equal to the square root of the former. (These proportions are called 
Snyder s ratios and were once used to test traits for simple recessive 

1 06 Chapter 3 

Organization of Genetic Variation 1 07 

mixture, but there is substantial linkage disequilibrium between the alleles, 
as shown a t the bottom of the table In the mixed population, D equals 81% of 
its theoretical maximum value. The sole cause of the disequilibrium is the 
differing allele frequencies in the subpopulations. Furthermore, the consider- 
ations in Table 3.2 make no assumption that A and B are on the same chro- 
mosome, hence linkage disequilibrium may result from population admixture 
even for genes on different chromosomes. If subpopulations become perma- 
nently mixed and undergo random mating, then Equation 3.8 implies that 
the induced linkage disequilibrium is expected to decrease at the rate r per 
generation, where r is the recombination fraction between the A and B genes. 
For unlinked genes, r = x / 2 


In any population, the genotype frequencies among zygotes are determined 
in large part by the patterns in which genotypes of the previous generation 
come together to form mating pairs. In random mating, genotypes form mat- 
ing pairs in the proportions expected from random collisions. For a gene with 
two alleles A and a in a random-mating population, the expected geno- 
type frequencies of AA, Aa, and aa are given by p 2 , 2pq, and q 2 , respectively, 
where p and q are the allele frequencies of A and a, respectively, with p + q-l. 
The expected genotype frequencies with random mating constitute the 
Hardy-Weinberg equilibrium (HWE). The rate at which the HWE frequencies 
are attained depends on the life history of the organism. In an organism with 
nonoverlapping generations, such as an annual plant, each generation is sep- 
arated in time from the preceding and the following generation; in this case, 
the Hardy-Weinberg frequencies are attained in one generation of random 
mating provided that the allele frequencies are equal in the sexes. In an 
organism with nonoverlapping generations, the approach to HWE is gradual. 
Slatistical tests of HWE are often based on the j£ 2 test, but this test is relative- 
ly weak in detecting departures from the expected frequencies, especially 
those caused by admixture of subpopulations differing in allele frequency. 

One of the principal implications of the HWE is that the allele frequencies 
and the genotype frequencies remain constant from generation to generation, 
hence genetic variation is maintained. Another major implication is that, 
when nn allele is rare, the population contains many more heterozygotes for 
the allele than it contains homozygotes for the allele. 

Extensions of the HWE include multiple alleles and X-linked genes. With 
multiple alleles, the expected frequency of a homozygous genotype A,A, 
equals pf, and the expected frequency of a heterozygous genotype A,A t equals 
2p,p r where p, and p, are the allele frequencies of A, and A r With X-linked alle- 
les, the genotype frequencies in females (XX) are given by the HWE but those 

in males (XY) are given by the allele frequencies. Consequently, for n recessive 
X-linked mutation with allele frequency q, the proportion of affected males (n) 
always exceeds the proportion of affected females (q 1 ); the rarer the recessive 
allele, the greater is the excess of affected males, 

Nonrandom association between the alleles of different genes is measured 
by the linkage disequilibrium parameter D. Random association between 
alleles of different genes is called linkage equilibrium, and it is indicated by 
D = When D * 0, the alleles are said to be in linkage disequilibrium. Ordi- 
narily, unless there is some countervailing ptocess that maintains linkage 
disequilibrium between two genes, D is expected lo go to zero at a rate deter- 
mined by the recombination fraction between the genes. For unlinked genes, 
D decreases by one-half in each generation; for genes that recombine with a 
frequency r, D decreases by the fraction r in each generation. Significant link- 
age disequilibrium is usually found in natural populations for genes that are 
tightly linked, for genes that are within or near an inverted segment of chro- 
mosome, or for genes in plant species that regularly undergo self-fertilization. 
Significant linkage disequilibrium can also result from admixture of two or 
more subpopulations differing in allele frequencies. 


1 . Phenylketonuria is an autosomal recessive form of severe mental retarda- 
tion. About one in 10,000 newborn Caucasians are affected. Assuming 
random mating, what is the frequency of heterozygous carriers? 

2. Mourant et a I. (1976) cite data on 400 Basques from Spain, of which 230 
were Rh" and 170 were Rlf. Estimate the allele frequencies of D and d. 
How many of the R/* + individuals are expected to be heterozygous? 

3 Kelus (cited in Mourant et al. 1976) reports a study of 3100 Poles, of 
whom 1101 were MM, 1496 were MM, and 503 were NN Calculate 'the 
allele frequencies and the expected numbers of the three genotypes and 
carry out a % 2 test for goodness of fit to random-mating proportions. 

4. Consider an autosomal gene with four alleles A u A,, A ,, and /I, with 
respective frequencies 0.1, 0.2, 0.3, and 0.4. Calculate the expected geno- 
type frequencies under random mating. 

5 Show that the proportion of heterozygous offspring from a heterozygous 
parent is '/ 2 in a population undergoing random mating for a single aene 
with two alleles. 

6 If random mating with two alleles gives frequencies D, H, and R for 
homozygous dominant, heterozygote, and homozygous recessive show 
that DR = bffi. 

7- When mating is random for a gene with two alleles A and a at frequen- 
cies p and q, show that the genotype frequencies of AA. Aa, and aa are 



112 Chapter 4 

the population structure of a widespread .species of freshwater fish. The low- 
est population level consists of a Ideal interbreeding population of animals 
within a stream. A stream may contain more than one such local population. 
The next-higher level in the hierarchy may be the organization of streams 
into groups feeding the same river. Another higher level may be rivers with- 
in watersheds. An even higher level of organization may be watersheds 
within continents. The aggregation of subpopulations into progressively 
more inclusive groups may continue for as many levels as is convenient and 
informative. It is inevitably somewhat arbitrary how the groups at each level 
are combined to form the next higher level in the hierarchy. The objective of 
the classification is informativeness: one tries to group the subpopulations in 
such a way as to highlight the genetic similarities and differences among 
them. If there were so much migration of fish among subpopulations that all 
members of the species constituted essentially a single, random-mating pop- 
ulation, then there would be no need to define a hierarchical population 
structure because it would be uninformative. However, most organisms do 
have significant population substructure. 

Reduction in Heterozygosity 

One of the important consequences of population substructure is a reduction 
in the average proportion of heterozygous genotypes relative to that expect- 
ed under random mating. The reason for the reduction in heterozygosity 
may be understood by considering the hypothetical example in Figure 4,1. 
The outline is the floor plan of a large barn. The organisms of interest are the 
mice concentrated primarily into two subpopulations of equal size at the 
west and east ends ol the barn. The movement of mice between the subpop- 
ulations is prevented by a large population of hungry and vigilant cats in the 
central area. The occasional mouse that comes out of its refuge is quickly 
eaten. (These hypothetical mice have not been endowed with the ingenuity 
to find alternative routes between the west and east ends of the barn, like 
sneaking along the rafters.) Because of chance effects in the founding of the 
subpopulations, the west and east subpopulations are completely homozy- 
gous for alternative alleles of a gene. All the mice in the west subpopulation 
are AA, and all those in the east subpopulation are act. In technical terms, the 
west subpopulation is fixed for the A allele (its allele frequency equals 1), 
and the east subpopulation is fixed for the a allele. The genotype frequencies 
of AA, Aa, and aa in the west subpopulation are 1, 0, and 0, respectively, and 
those in the east subpopulation are 0, 0, and 1 , respectively. Within each sub- 
population there is random mating, and the genotype frequencies, though 
extreme, still satisfy the Hardy -Weinberg principle. In particular, the 
frequencies of AA, Aa, and aa within each subpopulation are given by p 1 , 
2pq, and q 1 , where p = in the east subpopulation, and p = 1 in the west 

Figure 4.1 An extreme example of the general principle that a difference in 
allele frequency among subpopulations results in a deficiency of heterozygotes. 
The floor plan is that of a hypothetical barn. The mouse subpopulations in the 
east and west enclaves are completely isolated owing to the cats in the middle. 
The west subpopulation is fixed for the A allele and the cast subpopulation for 
the a allele. Trapping mice at random in the area patrolled by the rats would 
yield an overall allele frequency of '/ 2 but no heterozygotes. 

subpopulation Therefore, within any one of the subpopulations in Figure 4.1, 
the frequency of heterozygotes equals the frequency expected with HWE. 

The situation regarding the total population in Figure 4.1 is very different, 
however, as there is an overall deficiency of heterozygotes. By "total popula- 
tion" in this context, we mean the aggregate of all mice without regard to the 
population substructure. Suppose we were unaware of the population sub- 
structure in the barn. We might then suppose that the barn contained a single 
randomly mating population. To study the total population of the barn, we 
trap mice at random in the center area, catching the occasional escapee from 
the cats. Because the subpopulations are fixed for either A or a, half the time 
we would trap an AA homozygote and half the time an aa homozygote. Con- 
sequently, we estimate the allele frequency of A as p = '/ 2 - Assuming random 
mating and Hardy -Weinberg genotype frequencies in the total population, 
the expected genotype frequencies of AA, Aa, and aa are given by the HWE 
as p 1 , 2 pq t and q l . Because the overall allele frequency of A among the 
trapped animals is V2, we would naively expect a fraction 2 x >/ 2 x l / i = '/ 2 of 
the animals to be heterozygous In fact, we would have caught no heterozy- 
gotes at allf 


Population Substructure 

Hierarchical Structure F Statistics, Wahlund Effect 

DNA Typing Assortative Mating Inbreeding 

Inbreeding Coefficient 

opulation substructure is almost universal among organisms. 
Many organisms naturally form subpopulations in the form of 
herds, flocks, schools, colonies, or other types of aggregations. In 
addition, natural habitats are typically patchy, with favorable areas inter- 
mixed with unfavorable areas. Through time, even uniformly favorable areas 
can be disrupted by floods, fires, or other perils. When there is population 
subdivision, there is almost inevitably some genetic differentiation among 
the subpopulations. By generic differentiation we mean the acquisition of 
allele frequencies that differ among the subpopulations. Genetic differentia- 
tion may result from natural selection favoring different genotypes in differ- 
ent subpopulations, but it may also result from random processes in the 
transmission of alleles from one generation to the next or from chance differ- 
ences in allele frequency among the initial founders of the subpopulations. 
This chapter considers some of the consequences of population subdivision 
as well as other types of nonrandom mating. 


A population is said to have a hierarchical population structure if the sub- 
populations can be grouped into progressively inclusive levels in which, at 
each grouping, the next lower levels are included ("nested") within the next 
higher ones. To consider a concrete example, imagine we were interested in 


104 Chapter 3 

quently, x 2 = 0.1631 ? x 489 - 13.0 with one degree of freedom, for 
which the associated probability is 0.0004. Thus, there is significant 
linkage disequilibrium between these genes. The value of D ma * is the 
smaller of 0.053 and 0.102, and so D ma » = 0.053. The magnitude of the 
linkage disequilibrium, relative to its theoretical maximum, is 
0.012/0.053 = 22.6%. The j 2 can also be calculated from the expected 
numbers of the four gametic types, which are 410.1, 25.9, 49.9, and 
3.1, respectively. 

PROBLEM 3.1 6 Use the formula for % 2 in Problem 3-15 to evaluate 
the statistical significance of the linkage disequilibrium between alle- 
les of the gene for alcohol dehydrogenase in Drosophila melanogaster 
and the presence or absence of an EcoRI restriction site located 3500 
nucleotides downstream. The data are from a population descended 
from animals trapped at a Dutch fruit market in Groningen (Cross 
and Birley 1986). 

Adh F EcoRJ site present: 22 

Adh F EcoRI site absent : 3 

Adh s EcoRI site present: 4 

Adh s EcoRI site absent: 5 

ANSWER D = 0.085 and f = p z N = 0.453 2 x 34 = 7.0 with one degree 
of freedom; the associated probability value is approximately 0-01. 
The linkage disequilibrium is statistically significant and has a value 
of 49% of its maximum possible value. 

Linkage disequilibrium in local populations, such as seen in the preced- 
ing examples, can be caused by linkage disequilibrium in the founding pop- 
ulation that lias not yet had lime to dissipate due to the small value of r. 
Another possible cause of linkage disequilibrium is admixture of popula- 

Organization of Genetic Variation 1 05 

tions with differing gametic frequencies. A third possibility is natural selec- 
tion differentially favoring some genotypes over others to such an extent that 
it overcomes the natural tendency for D to go to zero. 

Several examples in which linkage disequilibrium typically is present in 
natural populations should be mentioned here. One case concerns plants that 
ordinarily undergo self-fertilization, and examples are discussed m Chapter 
4 in connection with the discussion of inbreeding. Another case involves cer- 
tain inversions that are polymorphic in populations of certain species of 
Drosophila, most notably D. psemioobscurti and D. subobscura and their rela- 
tives. A chromosome with an inversion, as the name implies, has a certain 
segment of its genes in reverse of the normal order. Because of the inverted 
segment the process of chromosome breakage and reunion in meiosis cannot 
be completed in the normal manner, with the result that the alleles in the 
inverted segment are usually unaffected by recombination and so they 
remain linked together. Because inversions prevent recombination, each 
inversion represents a sort of "supergene," and natural selection accumulates 
beneficially interacting alleles within each inversion. The beneficially inter- 
acting alleles are said to show genetic coadaptation. 

Linkage disequilibrium can also arise as an artifact of admixture of sub- 
populations that differ in allele frequencies. Organisms that are subdivided 
into local populations are said to have population substructure. An example 
of linkage disequilibrium arising from subpopulation admixture is illustrat- 
ed in Table 3.2. In this example, subpopulation 1 and subpopulation 2 are 
both in linkage equilibrium for the alleles of the A and B genes. Subpopula- 
tion 1 has an allele frequency of 0.05 for both A^ and B], and subpopulation 2 
has an allele frequency of 0.95 for both A^ and Bj. An equal mixture of organ- 
isms from both subpopulations has the gametic frequencies shown in the last 
column of Table 3.2. The allele frequencies of A] and B] are both 0.50 in the 

TABLE 3.2 




Subpopulation 1 

Subpopulation 2 

Equal mixture 

A 1 B l 

AyE 2 

A 7 B X 
A 2 Ej 








D = P u P n -P ]2 

D m;IX 

P 21 


-0 0025 





114 Chapter 4 

This rather paradoxical result — that there is a deficiency of heterozygotes 
in the total population even though random mating takes place within each 
subpopulation — is a consequence of the difference in allele frequency among 
the subpopulations. Were the allele frequencies in both subpopulations the 
same, it would not matter whether we sampled from the west subpopulation, 
the east subpopulation, or from the area in between. We would recover geno- 
types in Hardy- Weinberg proportions because both subpopulations are geno- 
typically identical and in HWE. In an organism with hierarchically 
structured subpopulations, there is an analogous deficiency of heterozygotes 
at each level in the hierarchy. The following section examines the heterozy- 
gosities in more detail. 

Average Heterozygosity 

In the Mohave desert, local populations of the annual plant Lmanthus parryae 
are polymorphic for white versus blue flowers The plant is diminutive, aver- 
aging just 1 cm in height, and when the plant is in bloom, the ground cover 
of white flowers justifies the popular name "desert snow." Blue flowers 
result from homozygosity for a recessive allele. The geographical distribu- 
tion of the frequency q of the recessive allele across a region of the Mohave 
desert is illustrated in Figure 4.2. Each allele frequency is based on an exam- 
ination of approximately 4000 plants over an area of about 30 square miles 
(Epling and Dobzhansky 1942). 

Judging from the allele-frequency map in Figure 4.2, the highest frequen- 
cies of the blue-flower allele are largely concentrated at the west and east 
ends of the region in question. The unequal allele frequencies across the 
range imply a decrease in average heterozygosity, relative to HWE, analo- 

10 miles 



Figure 4.2 Estimated frequency of a recessive allele for blue flower color in. 
populations of Lmtmthu* patrync in an area of approximately 900 square miles in 
the Mohave desert. Each allele frequency is based on an examination of approxi- 
mately 4000 plants over an area of about 30 square miles. (After Wright 1943a.) 

Population Substructure 1 1 5 

eous to the mouse example in Figure 4.1, though not as extreme Figure 4 2 
shows the estimated allele frequency in each of 30 subpopulations Suppose 
each of the subpopulations is regarded as a random-mating unit in HWE for 
the flower-color alleles. The average heterozygosity among the subpopula- 
tions can be denoted as H s , where the subscript indicates subpopulation. The 
calculations are shown in the third column in Table 4 1; the heterozygosity in 
each subpopulation is calculated as 2pq, where p and q are the estimated 
frequencies of the alleles for white versus blue flower color, respectively, in 
each subpopulation. The H s tabulated at the bottom is the average of all the 

TABLE 4.1 



Regions Total 


Average allele Average allele 




frequency Heterozygosity frequenty Heterozygosity 












\ ' 



0.5153 4995 

C 9 


















(1 1268 






0138 0272 










0.1888 3062 1374 2171 




H s = 1424 

ii K = 1589 //,- 2371 

Source Data from Wright 1943a 

116 Chapter4 


Population Substructure 117 

subpopulation heterozygosities (counting the value 000 a total of nine times 
because of the nine different subpopulations in which q - 0.000). 

A second hierarchical level of population substructure is that of region — 
west (W), central (C), or east (E). To calculate the heterozygosity expected 
from HWE in each region, we first estimate the average allele frequency in 
the region by taking the mean allele frequency across all subpopulations in 
the region. For example, the average allele frequency if in region E is {0.106 + 
0.224 + 0.411 + 0.014)/4 = 0.1888 In each region, the heterozygosity expected 
from HWE is calculated as 2pqr, where p and q are the average allele frequen- 
cies in the region. In region E, therefore, the regional heterozygosity equals 
2 x (1 - 0.1888) x 1888 = 0.3062. The average heterozygosity within regions 
at the bottom of column 5 is denoted H R ; it is the weighted average of the 
regional heterozygosities, where each regional heterozygosity is weighted by 
the number of subpopulations in the region. In this example, ff R = (6 x 0.4995 
+ 20 x 0.0272 + 4 x 0.3062) /30 = 1589. 

Yet another hierarchical level of population substructure in Figure 4.2 is 
the total population — the aggregate population obtained by conceptually 
uniting all subpopulations to form a single random mating unit. The average 
allele frequency is the mean allele frequency across all subpopulations, and 
q = 0.1374. Then H T is calculated as 2pq = 2x0 8626 x 0.1374 = 0.2371 

To sum up: 

• H s is the average HWE heterozygosity among organisms within random- 
mating subpopulations. 

• H R is the average HWE heterozygosity among organisms within regions. 

• H T is the average HWE hetero^ygosity among organisms within the total 

The concepts of hierarchical population structure and the various levels of 
heterozygosity were originally developed by Sewall Wright (1889-1988) to 
quantify genetic differences among subgroups at the various levels; he called 
his theory isolation by distance (Wright 1943a, 1943b). The motivation for 
developing such a method was summarized in the following passage. The 
term panmixia is a synonym for random mating. 

Study of statistical differences among local populations is an important line of 
attack on the evolutionary problem While such differences can only rarely 
represent first steps toward speciation in the sense of the splitting of the 
species, they are important for the evolution of the species as a whole. They 
provide a possible basis for intergroup selection of genetic systems, a process 
that provides a more effective mechanism for adaptive advance of the species 
as a whole than does the mass selection which is all that can occur under pan- 

Furthermore, the reduction in heterozygosity resulting from population 
substructure is intimately related to the reduction in heterozygosity caused 

by inbreeding— mating between relatives— as w shall stv later in this rhan- 
ter. Indeed, the relation of population substructure to inbreeding can be 
understood by interpreting each subpopulation as a sort of "extended 'fami- 
ly" or set of interconnected pedigrees. Organisms in the same subpnpulnhon 
will often share one or more recent or remote common ancestors, and so a 
mating between organisms in the same subpopulation will often be' a mating 
between relatives The larger the subpopulation, and the more recently it has 
been isolated, the smaller this inbreeding effect; nevertheless the analogy to 
inbreeding is valid. 

Wright's F Statistics 

To quantify the inbreeding effect of population substructure, Wright (1921) 
defined what has come to be called the fixation index. This index equals the 
reduction in heterozygosity expected with random mating at any one level 
of a population hierarchy relative to another, more inclusive level of the hier- 
archy. The fixation index is a useful index of genetic differentiation because 
it allows an objective comparison of the overall effect of population sub- 
structure among different organisms without getting into details of allele fre- 
quencies, observed levels of heterozygosity, and so forth. The genetic svmbol 
for a fixation index is F embellished with subscripts denoting the levels of 
the hierarchy being compared. For example, F SR is the fixation index of the 
subpopulations relative to the regional aggregates: 

F SR = 


In words, Equation 4.1 defines F SR as the decrease of heterozygosity 
among subpopulations within regions (H R - tf s ), relative to the heterozygos- 
ity among regions (H R ). For the Ltmmthus example in Table 4 1 F«.» = ff) 1589 
-0.1424)/0 1589 =0.1036. 

At the next level of the hierarchy, we may deline the fixation index F R7 as 
the proportionate reduction in heterozygosity of the regional aggregates rel- 
ative to the total combined population: 

FUT = 


The data in Table 4.1 shows that F RT = (0.2371 - 0.1589)/0.2371 = 0.3299. 
Comparison of this value with F SR above already makes it clear that there is 
substantially more variation among regions (as measured by F R1 ) as there is 
among subpopulations within regions (as measured by F SR ). The comparison 
of the fixation indices at the two levels gives quantitative expression to the 
regional differences apparent in Figure 4.2. 

118 Chapter 4 

The fixation index F sl compares the least inclusive to the most inclusive 
levels of the population hierarchy and measures all effects of population sub- 
structure combined: 

F^ = 

H T -H S 


From Table 4.1, F ST = (0 2371 - 0.1424)/0.2371 = 0.3993. The overall reduc- 
tion in average heterozygosity is therefore close to 40% of the total heterozy- 
gosity — a very substantial effect. 

The hierarchical F-statistics defined in Equations 4.1 through 4.3 are all 
types of fixation indices, but they differ in the reference populations: F SR is 
concerned with subpopulations (S) relative to the regional aggregates (R), F RT 
is concerned with the regional groupings relative to the total population (T), 
and F sr is concerned with the subpopulations relative to the total population. 
The index F ST is the most inclusive measure of population substructure. The 
mathematical relation between the three types of F statistics is demonstrated 
in the following problem. 

PROBLEM 4. 1 Show that Fsr, F m , and Fsr are related by me equation 

ANSWER From Equation 4.1, F^ = 1 - (H s / Hr), or 1 - F& « H$/H K . 
Equation 4.2 implies that F n = 1 - (H R ffli), or 1 - P m * Hr/H t . Final- 
ly, Equation 4.3 implies that F ST ■ 1 - (H S /H T ), or 1 - % * Hs/Hf. 
Now multiply the expressions for 1 - F SR and 1 - Fjtr together to 
obtain (1 - F^) x (1 - F m ) = (H 5 /H^} x (H K /H T ) = H s /H t » (1 - F ST )> 

For examining the overall level of genetic divergence among subpopula- 
tions, F S t is the informative statistic. Although F ST has a theoretical minimum 
of (indicating no genetic divergence) and a theoretical maximum of 1 (indi- 
cating fixation for alternative alleles in different subpopulations), the 
observed maximum is usually much less than 1. Wright (1978) has suggest- 
ed the following qualitative guidelines for the interpretation of F ST : 

• The range to 0.05 may be considered as indicating tittle genetic differen- 

Population Substructure 119 

• The range 0.(J5 to 0.15 indicates moderate genetic differentiation 

• The range 0.15 to 0.25 indicates great genetic differentiation 

• Values of F ST above 0.25 indicate very great genetic differentiation. 

On the other hand, Wright also notes that, among subpopulations, "dif- 
ferentiation is by no means negligible if F ST is as small as 0.05 or even less." 

PROBLEM 4,2 Some subpopulations of Drosophila melanogaster 
show an altitudinal gradient in the allozymes of alcohol dehydroge- 
nase in which the frequency of the Adh-F allele increases with alti- 
tude. The data in the accompanying table are estimates of the allele 
frequency of Adh-F in seven samples of adult flies captured either in 
the mountains, in the foothills, or on the plains of the Caucasus 
Mountains of the former Soviet Union. Each allele frequency is based 
on electrophoresis of approximately 300 adult flies (Grossman et al. 
1970). Calculate the F statistics F SE (subpopulations within elevations), 
F ET (elevations within the total), and F sr (subpopulations relative to 
the total). What do the magnitudes of the F statistics suggest regard- 
ing genetic differentiation among subpopulations in the frequency of 
Adh-F with respect to altitude? 

Altek Allele Allele 

Elevation frequency Elevation frequency Elevation frequency 













ANSWER Let p represent the allele frequency of Adh-F. For each 
subpopulation, the HWE heterozygosity equals 2p(l - p), which for 
the seven samples are 0.4359 and 0.3498 (mountain), 0.2277 and 0.1942 
(foothill), and 0.1506, 0.1605, and 0.0676 (plain). The average of these 
values is H s , which equals 0.2266. At each of the elevations, the aver- 
age allele frequency is the mean across the subpopulations sampled at 
that elevation. For mountain, foothill, and plain, these means equal 
0.274, 0.120, and 0.068, respectively, yielding the elevation HWE het- 
erozygosities 0.3974, 0.2112, and 0.1273, respectively. (Your results 
may differ slightly according to the number of significant digits you 

120 Chapter 4 


Population Substructure 121 

carry along.) The average of the elevation heterozygosities equals the 
mean elevation heterozygosity (H E ), and it is the weighted average 
(2 x 0.3974 + 2 x 0.2112 + 3 x 0.1273)/7 = 0.2285. Finally the ailele fre- 
quency for the total heterozygosity is equal to the mean allele fre- 
quency across subpopulations, which is 0.142, yielding a total HWE 
heterozygosity (H T ) of 0.2433. The F statistics are Fse = (H B - H s )/Hi5 = 
0.0081, F ET = (H T - H E )/H T = 0.0609, and Fsr = (H T ~ H s )/H r = 0.0684. 
[As a check, note that (1 - Fgc) x (1 - F^) = 1 - Fst-1 Judging from the 
magnitudes of the F statistics, it is clear that most of the differentia- 
tion among subpopulations is correlated with altitude; there is very 
little genetic differentiation among subpopulations at each elevation. 

The method of estimating the F statistics by replacing the parameters in 
Equations 4.1 through 4.3 with their observed or estimated values is not nec- 
essarily the best, particularly with small samples. Ideally, estimates of the F 
statistics should correct for the effects of sampling a limited number of sub- 
populations, as well as for the effects of sampling a limited number of organ- 
isms in each subpopulation. Methods for making these corrections have been 
suggested but are quite complex and raise additional issues. For an excellent 
discussion, see Weir and Cockerham (1984). Important issues are also 
addressed in Wright (1978, pp. 86-89), Curie-Cohen (1982), Nei and Chesser 
(1983), and Nei (1986). We will use the uncorrected estimation procedure, 
which is adequate for purposes of illustration. 

Genetic Divergence among Subpopulations 

The fixation index F ST defined in Equation 4.3 serves as a convenient and 
widely used measure of genetic differences among subpopulations. The 
identification of the causes underlying a particular value of F ST observed in 
a natural population is often difficult. Allele frequencies among subpopula- 
tions can become different because of random processes (random genetic 
drift) as well as by natural selection with complications from migration 
among the subpopulations. Difficulties in the assignment of cause do not, 
however, invalidate the usefulness of F S t as an index of genetic differentia- 

The levels of genetic divergence among human subpopulations and 
among subpopulations of several other species are presented in Table 4.2. The 
values of F S r imply that genetic divergence between human subpopulations 
is quite small. Of the total genetic variation found in three major races (Cau- 
casoid, Negroid, and Mongoloid), only 7% (0.07) is ascribable to genetic 



Number of Number 

populations of foci 







(nia)or races) 

Human, Yanonwma 






Indian villages. 

House mouse 






(Mus muscuhis) 

Jumping rodent 






(Dipodomys ordn) 








Hoiseshne crab 






(Limn! its) 

Lycopod plant 






(Lycopodnan luciduhim) 

Source. Protein electrophorehe data from Net 1975. 

differences_ajmong_racea. About93% of the total genetic variatiorUsJound 
within races. Similarly, of thejotil genetic variation found in the native 
Yanomama Indians otVeriezueh and Brazil, only 7 7% (0.077) is due to dif- 
ferences in allele frequency among villages. This result implies that 92.3% of 
the total genetic variation is found within any single village. Values of F &T 
for other organisms are quite variable, presumably because F ST is influenced 
by the size of the subpopulations— which is a major determinant of the mag- 
nitude of random changes in allele frequency— by the amount and pattern of 
migration between subpopulations, and by other factors, including natural 

Table 4.2 provokes a brief discussion of the sensitive term race because the 
term is prone to misunderstanding or misuse. In population genetics, a race 
is a group of organisms in a species that are genetically more similar to each 
other than they are to the members of other such groups. Populations that 
have undergone some degree of genetic divergence as measured by, for 
example, F ST , therefore qualify as races. Using this definition, the human 
population contains many races. Each Yanomama village represents, in a cer- 
tain sense, a separate "race," and the Yanomama as a whole also form a dis- 
tinct "race " Such fine distinctions are rarely useful, however. It is usually 
more convenient to group populations into larger units that still qualify as 
races in the definition given. These larger units often coincide with races 

122 Chapter 4 

Population Substructure 123 

based on physical characteristics such as skin color, hair color, hair texture, 
facial features, and body conformation. Contemporary anthropologists tend 
to avoid "race" as a descriptive term for human groups because cultural and 
linguistic differences, which are also important, are often discordant with 
genetic differences and sometimes discordant with each other. 

Here il must be pointed out thai the data in Table 4.2, which indicate 
much more genetic variation within than among human races, may be mis- 
leading. The conclusion is based primarily on genes determining allozymes, 
and it certainly is not true for genes influencing skin color, hair color, hair tex- 
ture, and other traits that most people think of in connection with the word 
"race." However, skin color and other prominent racial characteristics are 
used to delineate races precisely because racial differences for these traits are 
rather large, so the genes involved cannot be representative of the entire 
genome. On the other hand, allozyme loci may not be very representative of 
the genome either. See Nei and Roychoudhury (1982) for a review of the 
genetic relationship and evolution of human races. 

In human population genetics, the Wahlund principle is usually cited for 
its implication that fusion of subpopulations results in a decrease in the aver 
age frequency of children born with a genetic disease resulting' from 
homozygosity for a rare recessive allele, particularly an allele with a relative- 
ly high frequency in one of the subpopulations. Examples of harmful reces- 
sive alleles at h lg h frequency in some human subpopulations include in 
Caucasians, the alleles for a.-antitrypsin deficiency U] - 0.024) and cystic 
fibrosis (? = 0.022); in blacks, sickle-cell anemia ( (/ = 0.05 in American blacks 
up to q = 0. 1 in some African populations); in the Hop. and some other South- 
west American Indian tribes, albinism (q = 0.07); and, in Ashkenazi Jews Tay- 
Sachs disease (4 = 0.013). y 

The Wahlund principle for a recessive allele in two subpopulations is 
ilJustraied in Figure 4.3A. The west subpopulation has allele frequency <i, and 
genotype frequency af ; the east subpopulation ha, allele frequency q 2 and 
genotype frequency q\. The average frequency of the homozygous recessive 


The flip side of the coin of heterozygosity is homozygosity because a diploid 
organism that is not heterozygous must be homozygous. Mathematically, 
homozygosity = 1 - heterozygosity. Therefore, a corollary of the deficit in aver- 
age heterozygosity, relative to HWE, that results from population substruc- 
ture is that there is an equal excess in average homozygosity If the popula- 
tion substructure is eliminated and the former subpopulations undergo ran- 
dom mating, the average homozygosity decreases, and the average het- 
erozygosity increases by an equal amount. The phenomenon that the aver- 
age homozygosity decreases when subpopulations join together is called 
isolate breaking or the Wahlund principle, alter the Swedish statistician 
and human geneticist Sten Gcista William Wahlund (1901-1976) who first 
described the effect (Wahlund 1928). 

The subpopulations of hypothetical mice in Figure 4.1 afford an illus- 
tration of the Wahlund principle. As long as the cats keep the subpopulations 
separate, the homozygosity equals 1 because the west subpopulation is geno- 
typically AA and the east subpopulation is genotypically aa. If the cats were 
to disappear and the subpopulations of mice came together and practiced 
random mating, the genotype frequencies would be % AA, V 2 Aa, and V 4 aa. 
The homozygosity in the fused population is 1 / 4 + >/ 4 = x / 2 , which is a substan- 
tial decrease over the average in the subpopulation prior to fusion and ran- 
dom mating Not only is the total homozygosity reduced by population 
fusion, so is the average frequency of each homozygous genotype. Consider 
aa, for example Prior to fusion, the average frequency of aa across both sub- 
population equals V^; after fusion and random mating, the frequency of aa 
equals %. 

(A) Separate subpopulations 

Average- R Mpmto =?Il?!_ 

(B) Fused subpopulations 

p _ (fi + <t2 


igure 4.3 Illustration of the Wahlund principle. The frequency of homozy- 
h us recessiws after population fusion and random mating is less than [lie aver- 
se frequency before fusion. The difference in frequency of the homozygous 

riswvcs equals the variance in allele frequency among Ihe subpopulations 

124 Chapter 4 


across both subpopulations equals {c\\ f q\)/2. The result of fusion of the sub- 
populations is shown in part B. Assuming that the subpopulations are equal 
in size, the allele frequency in the combined population is If = (<ji + (fi)/2, and 
the genotype frequency with HWE equals fj l . Therefore, were the subpopu- 
lations in part A to fuse and come into HWE, the average frequency of 
homozygous recessives would be reduced by an amount given by: 

R - 9? +t ?2 ~ n l 

~ K fused <] 

= ^~(i) 2+ ^2~qf 


= °S 

In Equation 4 4, we leave it as an exercise to verify that the expressions in 
q x and q 2 on the first and second lines are equal The symbol o<f is the variance 
in allele frequency among the original subpopulations Because the variance 
is always nonnegative, isolate breaking always decreases the average fre- 
quency of homozygous recessives unless the allele frequencies are equal to 
begin with. Furthermore, the result in Equation 4 4 is true for any number of 
subpopulations of equal or unequal size; in words: 

Fusion of subpopulations with random mating and HWE decreases the aver- 
age frequency of homozygous recessives by an amount equal to the variance 
in allele frequency among the original subpopulations. 

To illustrate the effect of isolate breaking, imagine a subpopuladon of 
gray squirrels that has a high frequency of albinism equal to 16%. (Albinism 
is an inherited absence of pigment resulting from a homozygous recessive 
gene.) In a nearby forest there is another subpopulation of equal size in 
which the albino mutation is absent, so that the allele frequency in this sub- 
population is 0. Overall, the average frequency of albinos in the two popula- 
tions is (0.16 -+■ 0)/2 = 8% If the two subpopulations fused with random 
mating and HWE, the allele frequency of the albino mutation in the fused 
population would be (0.4 + 0)/2 = 0.2, and the frequency of the homozygous 
recessive would equal 0.2 2 = 4%. The frequency of albinos in the fused popu- 
lation is substantially smaller than the average frequency in the original sub- 

PROBLEM 43 Tay-Sachs disease is an autosomal-recessrvfc degen- 
erative disorder of the brain that usually leads to death in infancy or 
early childhood. Among Ashkenazi Jews, the incidence of the condi- 
tion is about 1 in 6000 births but, in other groups, the incidence i» 

Population Substructure 125 

about 1 in 500,000 births (Myrianthopoulos and Aronson 1966) What 
incidence of the disease would be expected among the offspring of 
matings of Ashkenazi f ews with members of other groups? If these 
offspring were to mate randomly among themselves, what incidence 
of the disease would be expected in future generations? 

ANSWER The allele frequency of the Tay-Sachs mutation among 
Ashkenazi Jews is e stimated as ft = V(l /6,000) = 1.291 x 10""*; 
in other groups, ft = V(l/500,000) = 1.414 x 10~ 3 . In matings between 
members of the two groups, the expected frequency of homozygous 
recessives is q } q 2 = 1 -826 x 10~ 5 , or about 1 in 55,000 births. There is actu- 
ally a greater reduction in the first generation than in subsequent gener- 
ations because each mating in the first generation combines a high- 
risk gamete with a low-risk gamete. The allele frequency in the first-gen- 
eration offspring is (^ + <fc)/2 = 7.162 x 10" 3 and, with HWE in subse- 
quent generations, the homozygous recessive frequency stabilizes at 
(7.162 x 10" 3 ) 2 = 5.130 x 10" 5 , Or about 1 in 19,000 births. (The fact that 
homozygous recessives do not reproduce has been ignored because the 
effect is negligible.) 

Wahlund's Principle and the Fixation Index 

Equation 4.4 applies equally well to AA hornozygotes as to aa homozygotes. 
Therefore, letting P represent the frequency of homozygous AA genotypes, 

we can write 

"•separate ' fuwd ~ CTj, 


When there are only two alleles, the total reduction in homozygosity must 
be the summalion of Equations 4.4 and 4.5, which equals of, + of Because 
there are only two alleles, it is also true that of = of, which we will write as 
o . Hence, the total reduction in homozygosity from the Wahlund effect upon 
population fusion and HWE can be expressed as follows: 

Reduction in total homozygosity = 2a 1 

On the other hand, the reduction in total homozygosity with popula- 
tion fusion must also equal the increase in heterozygosity — the term H-[ - 
H s in Equation 4 3 — which is the numerator of F S1 . Hence, F sl = (H T - 
H s )/H r = 2o 2 /H r . However, H T is the heterozygosity with HWE using the 
average allele frequencies — pand t/— across subpopulations. Therefore, the 

1 26 Chapter 4 

Population Substructure 1 27 

connection between the fixation index F sr and the variance in allele fre- 
quency is 



Consequently, the F statistics at the various levels of a hierarchical popu- 
lation are related to the variances in allele frequencies among the subpopula- 
tions grouped together at the various levels. Equation 4.6 affords a 
convenient method of estimating F ST from allele-frequency data. For exam- 
ple, among the subpopulations of Linanthus in Figure 4.2, the variance in 
allele frequency is 0.0473. Earlier we calculated (he average allele frequencies 
as p = 0.8626 and q = 0.1374. Hence, o 2 /(f x q) = 0.3993, which confirms the 
previous calculation that F^ = 0.3993. (The values as stated may differ slightly 
from yours because they were calculated with more than four significant digits.) 

PROBLEM 4.4 The data in the accompanying table are the allele fre- 
quencies of several genes in three human subpopulations: (A) blacks 
from West Africa; (B) blacks from Claxton, Georgia; and (C) whites 
from Claxton, Georgia (Adams and Ward 1973). Each gene has two 
predominant alleles and may, for purposes of this problem, be con- 
sidered to have only two alleles. The genes control the MN blood 
group (alleles M and N), the Ss blood group (alleles 5 and 5), the 
Duffy blood group (alleles Ftf and Fy b ), the Kidd blood group (alleles 
flf and j£), the Kell blood group (alleles Jtf and Jf), the enzyme glu- 
coses-phosphate dehydrogenase (alleles G6PD~ and G6PD\ and JJ- 
hemoglobin (alleles f and p + ). For each gene, use Equation 4.6 to 
estimate F ST for the comparison A versus B and for the comparison A 
versus C. Classify each F^ as indicating tittle, moderate, great, or very 
great genetic differentiation according to Wright's qualitative guide- 
lines. Note: In comparing two subpopulations with two alleles in 
each, the variance in allele frequency is a 2 = (pi - p 2 ) 2 /4. 




Blacks (West Africa) Blacks (Ceorgla) Whftei (Georgia) 










ANSWER The estimates and their qualitative interpretations are as 
shown in the table. It is clear that the degree of genetic divergence 
between West African blacks and Georgia blacks, as assessed by the 
average Fst value, is relatively small. However, some of the genes 
show substantial genetic divergence between blacks and whites. 
Note, however, that the fixation index can differ substantially from 
one gene to another. This compilation of genes includes gradations of 
genetic divergence ranging from little to very great. 


A versus B 

A versus C 





P s 

0.0001 mum 

0.0004 (tittle) 
0.0230 (little) 
0.0031 Uittie) 
0.0001 (little) 
0.0067 (little) 
0.0089 (little) 
0.0060 (little) 

0.0011 (little) 
Q 01 fA (tittle) 

0.2676 (very great) 
0.0260 (tittle) 
0.0591 (moderate) 
0.0965 (moderate) 
0.0471 (Utile) 
0.0734 (moderate) 

Genotype Frequencies in Subdivided Populations 

In many organisms in which the population structure is hierarchical, it is 
useful to be able to calculate directly the average genotype frequencies across 
all subpopulations. Equations 4.4 through 4.6 make it possible to deduce the 
average genotype frequencies. Consider first Equation 4.4, which pertains to 
the genotype frequency of AA. The quantity called f\ cp ,„,,i P is what we wish 
to calculate: it is the average frequency of AA across subpopulations The 
quantity D fllwd equals p 2 — the genotype frequency of A A with population 
fusion and HWE. The value of a 2 is also known from Equation 4 6 it equals 
Fst x p x Jj Putting all this together, the average genotype frequency of AA 
across subpopulations must equal p 2 +• F S \fq. Likewise, interpreting Equa- 
tion 4.4 in the same manner as Equation 4.5 yields the average genotype fre- 
quency of aa across subpopulations as q 2 + F S | pq 

Because every genotype that is not homozygous must be heterozygous, 
the average genotype frequency of heterozygofes across subpopulations is 
given by 1 - (f 2 + F S7 pq] - (q 2 + F^pq). Note that I - J> 2 -q 2 = 2 Jiq and so the 
average frequency of heterozygotes simplifies to 2pif-2 pq F M . 

The genotype frequencies in a subdivided population are important 
enough to be displayed. 

128 Chapter 4 

AA: fy ? +pTjF^ 
Aa: 2jitj-2/»ijr ST 


These genotype frequencies are the average genotype frequencies across 
all subpopulations. They do not obey the Hardy- Weinberg principle because 
there is an excess of homozygotcs and a deficiency of heterozygotes relative 
to HWE. The result is somewhat paradoxical because, within any particular 
subpopulation, the genotype frequencies do obey the Hardy- Weinberg prin- 
ciple with whatever allele frequencies are found in that subpopulation. The 
reason for the validity of HWE within each subpopulation is the assumption 
of random mating within each subpopulation. Ihe reason for the departure 
from HWE in the population as a whole is that Ihe subpopulations differ in 
allele frequency Because the allele frequencies differ, random mating within 
each subpopulation is not equivalent to random mating among all the organ- 
isms in the entire population. 

From the expressions in Equation 4.7, it is clear that the value of F ST deter- 
mines the degree of departure from HWE. If F ST = 0, the second term in each 
expression vanishes, and the genotype frequencies reduce to the HWE; on 
the other hand, F ST = means that there is no variation in allele frequency 
among the subpopulations for the gene in question. Because F ST may vary 
from one gene to the next, other genes in Ihe same subpopulations may have 
nonzero values of F ST . The extreme case is F S7 = 1, which happens when two 
subpopulations are fixed for alternative alleles. In this case, the average allele 
frequencies are V 2 for each allele and the average genotype frequencies of AA, 
An, and aa across subpopulations are L / 2 , 0, and V 2 , respectively. This case is 
illustrated in Figure 4.1. 


The term DNA typing means the application of molecular genetics to high- 
ly polymorphic genetic markers for the purpose of matching DNA samples 
from unknown people with those of known suspects. Applications include 
paternity testing, in which DNA from a child is matched against that of an 
accused father, and criminal investigation, in which a crime-scene sample of 
DNA from blood, semen, or other sources is matched against that of one or 
more suspects. DNA typing undoubtedly ranks with the use of fingerprints 
as a major innovation in personal identification. 

In theory, DNA typing is not as powerful as ordinary fingerprinting. Fin- 
gerprints result from the pattern of raised skin ridges that carry sweat glands. 
The ridge pattern on each finger may form an arch, loop, whorl, or other 
design. The ridges vary in pattern from one person to the next so greatly that 

Population Substructure 129 

each person has unique fingerprints suitable for personal identification. 
When the fingers are formed in the embryo, the fingertips develop as fluid- 
filled pads. The fluid is later resorbed and the expanded skin collapses, form- 
ing the ridges There is a strong random component to the manner in which 
the skin collapses, and so the details of the fingerprint pattern differ in each 
finger and in each person. Even identical twins have different fingerprints. 
However, certain general features of the fingerprints are strongly inherited — 
for example, the total number of ridges on all the fingers, without regard 1o 
pattern. The Dionne quintuplets — five Canadian girls born in 1934, all 
formed from the splitting of a single fertilized egg — had total ridge counts 
ranging between 99 and 102; by comparison, their older siblings had total 
ridge counts of 69, 78, and 139 

It is the random component in fingerprint ridge pattern that makes fin- 
gerprints so powerful for personal identification. DNA types are inherited 
and so are not necessarily unique in each person. Even for a highly polymor- 
phic marker in which both parents are heterozygous — for example, in the 
mating A,A } x A k A) — any particular genotype in an offspring has a i ( i chance 
of being matched in a sibling owing to Mendelian segregation. Thus, strong 
evidence that an unknown DNA sample comes from a particular suspect can 
come only from the matching of a combination of genotypes across a number 
of polymorphic loci. The strength of the evidence increases with the number 
of loci that are examined and number of alleles present in the population The 
greater the number of loci, and the more highly polymorphic Ihe loci, the 
stronger the evidence linking the suspect to the unknown sample. Although 
matching DNA types may provide strong evidence that a suspect is the 
source of an unknown sample, a DNA mismatch is usually conclusive. When 
the DNA of a suspect contains alleles thai are clearly not present in the 
unknown sample, then the sample must have originated from a different per- 

Polymorphisms Based on a Variable Number of Tandem Repeats 

The type of polymorphism usually used in DNA typing in the United States 
is illustrated in Figure 4.4. Each allele of a locus is defined by the size of a 
restriction fragment that hybridizes with a locus-specific probe in a Southern 
blot (Chapter 2). The restriction fragments differ in size according to the 
number of copies they contain of a short sequence of nucleotides repeated in 
tandem. When there are more copies of the repeating unit, the restriction 
fragment is of greater size. A polymorphic gene of this type is called a VNTR 
polymorphism, which means that the restriction fragments contain a vari- 
able number of tandem repeats. VNTRs are employed in DNA typing 
because many alleles are possible because of the variable number of repeat- 
ing units. Although many alleles may be present in the population as a 

1 30 Chapter 4 

Probe DNA 

Allele 1 

Allele 2 

Allele 3 

Allele 4 


Allele 6 

Restriction site 

Restriction site 

Sequences repeated 
in tandem 

Fiqure 4 4 Allelic variation resulting from a variable number of units repeated 
in tandem in a nonessential region of a gene The probe DNA detects a restric- 
tion fragment for each allele. The length of the fragment depends on the num- 
ber of repeating units present. (From Hart! 1994.) 

whole any one person can have no more than two alleles of each VNTR 
locus. An example of a VNTR used in DNA typing is shown in Figure 4.5. 
The lanes in the gel labeled M contain multiple DNA fragments of known 
size to serve as molecular-weight markers. Each numbered lane contains 
DNA from a different person. Two typical features ol VNTRs are to be noted: 

• Most people are heterozygous for two VNTR alleles with restriction frag- 
ments of different size. Heterozygosity is indicated by the presence of 
two distinct bands. In Figure 4.5, only the persons numbered 2 and 5 
appear to be homozygous for a particular allele. 

• The restriction fragments from different people cover a wide range of 
sizes The variability in size indicates that the population as a whole con- 
tains many VNTR alleles. 

Figure 4.5 also makes it clear why VNTR polymorphisms are useful in 
DNA typing: each of the 13 people has a different DNA type (pattern ol 
bands) for this VNTR and therefore could be distinguished from any other 
person On the other hand, the uniqueness of each DNA type in Figure 4.5 
results in part from the small sample size. If more people were examined, 
then DNA types that matched by chance might well be found among unre- 

Population Substructure 131 

Ml 2 3 4 C M 5 6 7 8 9 M 10 11 12 13 M 

Figure 4.5 Genetic variation in a VNTR used in DNA typing. Each numbered 
lane contains DNA from a single person. After digestion of the DNA with a 
restriction enzyme, the fragments are separated by electrophoresis and hybrid- 
ized with a radioactive probe DNA. The lanes labeled M contain molecular- 
weight markers; lane C is another tvpe of internal control. (Courtesy of R. W. 

lated people. For example, in one study of five VNTR loci, the chance of a 
match between unrelated people ranged from 1 /20 to 1 /200, depending on 
the locus (Herrin 1993). Although less common, chance matches for two 
VNTR loci can also be found among unrelated people. The same study found 
two-locus matches at frequencies of 1/2,500 to 1/50,000. Even chance match- 
es for three VNTR loci are far from impossible In one study of Italians from 
Milan, three-locus matches were found at a frequency of approximately 
1/1,200 (Krane et al. 1992). Because of the possibility of chance matches 
between VNTR types, applications of DNA typing are usually based on at 
least three loci and preferably more. Matches at 7 lo 9 VNTR loci are virtual- 
ly definitive of identity — barring technical errors in the DNA typing itself 
(such as mislabeling of blood samples) and except for identical twins. 

DNA typing can be exclusionary as well as incriminating. For example, if 
the DNA type of a suspected rapist does not match the DNA type of semen 
taken from the victim, then the suspect could not be the perpetrator — unless 
there is some reason to suspect that the test itself was faulty. For example, 
Figure 4.6 shows the DNA profiles of nine VNTR loci among three suspects 
and from evidence recovered in seven serial rape cases. The label M denotes 
molecular- weight markers (present in four lanes in each panel), S1-S3 
denotes three suspects in the cases, and U1-U7 denotes DNA from semen 

1 32 Chapter 4 

samples recovered from the seven victims. Suspects Si and S3 are excluded 
by I he UNA typing, but S2 matches at all nine loci Based on this and other 
evidence, a jury convicted suspect S2 of 81 criminal counts related to these 
and other cases. He was sentenced to 139 years in prison and will not become 
eligible for parole until the year 2087. 

Match Probabilities with Hardy-Weinberg Equilibrium 
and Linkage Equilibrium 

If a person is found whose DNA type matches that of a sample found at the 
scene of a crime, how is the significance of the match to be evaluated? The 
significance of the match depends on the likelihood of it happening by 
chance, and hence matches of rare DNA types are more telling than match- 
es of common DNA types. Initially, the method for estimating the frequency 
of a DNA type in the population was to use a cross-multiplication square like 
that in Figure 3.6, extended to multiple alleles, to calculate the expected fre- 
quency of the particular genotype for each VNTR locus; this calculation 
assumes Hardy-Weiberg equilibrium (HWE). The locus-by-locus frequencies 
were then multiplied together to obtain the expected frequency of the multi- 
locus match; this calculation assumes linkage equilibrium. With HWE and 
linkage equilibrium, the expected frequency of a DNA type in the population 
as a whole is calculated as 


where capital II means chain multiplication. The first multiplication is across 
all loci presumed to be homozygous owing to the presence of a single band 
in the gel; for each locus, p, is the trequency of the allele that is homozygous. 
The second multiplication is across all heterozygous loci and, for each locus, 
the factor is two times the product of the frequencies of the alleles that are 
heterozygous. Because human subpopulations can differ in their allele fre- 
quencies, the calculation would be carried out using allele frequencies 
among Caucasians for white suspects, using those among blacks for black 
suspects, and using those among Hispanics for Hispanic suspects. 

Effects of Population Substructure 

The multiplication in Equation 4.8 makes a number of assumptions about 
human populations: (1) that the Hardy-Weinberg principle holds for each 
locus, (2) that each locus is statistically independent of the others so that the 
multiplication across loci is justified, and (3) that the only level of popula- 
tion substructure that is important for DNA typing is that of race. Critics of 
the multiplication rule argued that genetically important subpopulations 

IV > 

ri2 W 





Population Substructure 133 

= — 5 r = r= **~ ~* = r 

s^^-< M'i'i i'iiu'miu l^ Ufi Li" m 

M Si S2 b3 M ui U2 jj3MU4iaujuUZM- T^Ts2Sra InTljmMTMTgTirrrM- "M^TsT3lWiMMDI0iD(rWlvr 

Figure 4.6 An example of DNA typing. Suspect S2 matches evidence samples in seven rape 
cases (U1-U7) for each of nine VNTR loci (D1S7, D2S44, D4S139, and so forth) Suspects SI 
and S3 do not match and are excluded. The lanes labeled M contain molecular-weight mark- 
ers. (Courtesy of Steven L. Redding, Office of the Hennepin County District Attorney, Min- 
neapolis, and Lowell C. Van Berkorn and Carla f . Finis, Minnesota Bureau of Criminal 

1 34 Chapter 4 

need not coincide with racial designations. For example, the term 
"Hispanic" includes a mixture of different subpopulations with variable 
amounts of Spanish, native American Indian, and African ancestry. 
Similarly, there are potentially important differences in allele frequency 
among Caucasian populations (for example, Finnish people versus Italians) 
and among biack populations (for example, blacks from Africa versus 
blacks from Trinidad). Furthermore, if the allele frequencies of different 
VNTRs differ among subpopulations, then the loci are not statistically inde- 
pendent — even if they are genetically unlinked — and so the multiplication 
across loci is unjustified. Because of population substructure, DNA matches 
across multiple VNTRs could be more common among people within a par- 
ticular ethnic group than among people drawn at random from the popula- 
tion as a whole, and so calculations of genotype frequency should be based 
on the ethnic group of the accused person and not on the race as a whole. 
On the other side, defenders of the multiplication rule argued that popula- 
tion substructure would have a relatively minor effect on the final outcome 
of the calculation and that what matters most is not a high degree of accu- 
racy but rather a general sense of whether a particular multilocus genotype 
is rare or common. After much acrimony in the scientific community and in 
courts of law, a panel of the National Research Council (NRC 1992) recom- 
mended a compromise called the ceiling principle in which a modified mul- 
tiplication procedure was adopted using, for each allele frequency, a "ceil- 
ing" equal to the larger of either 0.10 or the upper 95% confidence limit of 
the highest frequency of the allele observed among at least three racial data- 

Even this recommendation proved controversial because some population 
geneticists regarded the compromise formula as too conservative. Continu- 
ing controversy prompted the formation of a second panel of the National 
Research Council (NRC 1996), which recommended the use of a modified 
product rule that takes moderate population substructure into account. 
According to this recommendation, in most cases the match probability may 
bo calculated according to the left-hand side of (he following- 



In this expression, r>, and p, have the same meaning as in Equation 4.8, and 
F SI is the fixation index among the subpopulations in the larger whole (typi- 
cally a major racial group). The use of the calculation is justified by the 
inequality Each factor on the right-hand side of this inequality is the per- 
locus genotype frequency calculated from Equation 4.7, which takes F ST into 

Population Substructure 1 35 

account. The left-hand side is greater than the right-hand side because, for 
each homozygous locus, it can be shown that 2p, > p} + p,(l - ?>,)F S -,; and , for 
each heterozygous locus, it is clear that 2p,p ) > 2p,p l - 2p,p, F ST because F ST > 0. 
Equally as important as the calculation itself, the committee emphasized, was 
the principle that no probability value should be cited unless accompanied 
by an appropriate 95% confidence interval to indicate its degree of reliability. 
The 1996 report also enumerated a number of special situations in which 
alternative formulas are required because of population substructure or 


When matings take place between relatives, the pattern of mating is called 
inbreeding. In human beings, the closest degree of inbreeding usually 
encountered in most societies is first-cousin mating. Many plants regularly 
undergo self-fertilization, and some insects regularly practice brother-sister 
mating. Inbreeding need not unite close relatives, however. As we shall see, 
a certain level ol inbreeding is inescapable in small subpopulations because 
the members of a subpopulation typically share recent or remote common 
ancestors. The common ancestry between mating pairs constitutes inbreed- 
ing. Hence, the genetic differentiation among subpopulations described by 
the hierarchical F statistics can be interpreted as a sort of inbreeding effect 
resulting from population substructure The relalionship between popula- 
tion substructure and inbreeding is a subtle one, but it has profound conse- 
quences in population genetics. 

Genotype Frequencies with Inbreeding 

The main effect of population substructure is a decrease in average het- 
erozygosity among subpopulations, relative to the heterozygosity expected 
with random mating in a hypothetical total population. Likewise, the main 
effect of inbreeding is to produce organisms with a decrease in heterozygos- 
ity, relative to the heterozygosity expected with random mating in the same 
subpopulation. The decrease in heterozygosity due to inbreeding can be 
illustrated with the example of repeated self-fertilization. Consider a self- 
fertilizing population of plants that consists of V4 AA, '/ 2 Aa, and '/ 4 aa geno- 
types, which are in Hardy- Weinberg proportions. Because each plant under- 
goes self-fertilization, the AA and aa genotypes produce only AA and aa off- 
spring, respectively, and the Aa genotypes produce V 4 AA, 1 / 2 Aa, and '/ 4 aa 
offspring. After one generation of self-fertilization, therefore, the genotype 
frequencies of AA, Aa, and aa are: 


v 4 xi + y 2 x'/ 4 = Yb 

1 36 Chapter 4 


V 2 xV 2 = % 

1 /4Xl + l / 2 x'/ 4 = Vh 


These genotype frequencies are no longer in Hardy-Weinberg propor- 
tions. There is a deficiency of heterozygous- genotypes and an excess of 
homozygous genotypes. After a second generation of self-fertilization, the 
genotype frequencies are 7 / ]fl AA, 2 / 16 An, and ? / ]ft aa, which have an even 
greater deficiency of heterozygoses Note, however, that the allele frequency 
of A remains constant. Denoting the allele frequency of A as p, then: 

In the initial population: 
Alter one generation of selfing: 
After two generations of selfing: 

P='/4 + , / 2 xV2= , /2 

p = 3/ 8 + V 1 xV 8 ='/ 2 

The example of self-fertilization illustrates the general principle that 
inbreeding, by itself, does not change the allele frequency. One assumption 
required for constant allele frequencies under inbreeding is that all geno- 
types must have an equal likelihood of survival and reproduction, which is to 
say that no natural selection takes place. If there is selection, then the allele 
frequencies can change with inbreeding (or, lor that matter, with any mating 

The effects of inbreeding can be made quantitative by comparing the pro- 
portion of heterozygous genotypes among inbred organisms with the pro- 
portion of heterozygous genotypes expected with random mating. To be 
precise, consider a gene with two alleles, A and a, at respective frequencies p 
and q (with p +- q = 1) Suppose thai the frequency of heterozygous genotypes 
in a sub-population of inbred organisms is some quantity H t . Were the sub- 
population undergoing random mating, the HWE frequency of heterozygous 
genotypes would be 2pq. However, for the sake of generality, we will denote 
the random-mating heterozygosity by the symbol H n . The effects of inbreed- 
ing can be defined as the proportionate reduction in heterozygosity relative 
to random mating. This value is expressed mathematically as (H - H } )/H ; 
this ratio is usually denoted by the symbol F, which is called the inbreeding 
coefficient At this point, the use of F for the inbreeding coefficient may seem 
a poor choice in view of the use of F S1 and related symbols for measuring the 
effects of population sxibstructure, but we will see in a lew moments that 
inbreeding and population substructure are intimately related. 

Thus we define 

F = 


Population Substructure 137 

In biological terms, F measures the fractional reduction in heterozygosity 
of an inbred subpopulation relative to a random-mating subpopulatton with 
the same allele frequencies. Because H„ = 2pq, the frequency of heterozygous 
genotypes in the inbred subpopulation can be written m terms ot F is 
H, = H„ - H F = H (1 - F) = 2 P q{\ - F). 

The frequency of AA homozygous genotypes in an inbred subpopula- 
tion can also be expressed in terms of F Suppose that the proportion of AA 
genotypes is denoted P. Because the allele frequency of A is p, we 
must have, by Equation 4,9 that P + H r /2 = p. But H, = 2/w(l - F), and so 
P = p-2 P q(\~F)/2. 

PROBLEM 4.5 Use the relation P = p - 2pq(l - F)/2 and the fact that 
p + q = 1 to show that F = p 2 + pqF. Show also that P can be written as 
P = p 2 (l-F)+pF„ 

ANSWER P*p- 2pq(l ~F)/2 = p -pq{\ -F) = p-pq+pqF = p(l-q) 
+ pqF = p + pqF. This establishes the first identity. Then, substituting 
for q in the second term, P = p z + p(l - p)F = p 2 + pF- p 2 F = p 2 (l - F) + 

Problem 4.5 shows that the frequency of AA genotypes in an inbred sub- 
population equals p 2 (l - F) + pF. In a similar manner, it can be shown that the 
frequency of aa genotypes is q 2 (\ - F) + qF. 

In summary, in a subpopulation of organisms with inbreeding coefficient 
F, the genotype frequencies are expected in the proportions: 

AA : p 2 (l -F) + pF = p 2 + pqF 
Aa:2pq(l-F) = 2pq-2pqF 
an : q 2 {\ -F) + qF = q 2 + pqF 


The expressions at the far right in Equation 4 10 facilitate comparison of 
the genotype frequencies expected with inbreeding relative to those expected 
with HWE, With inbreeding, there is a deficiency of heterozygotes equal to 
2pqF and an excess of each homozygous class equal to half the deficiency of 
heterozygotes. The biological reason that the missing heterozygotes are allo- 
cated equally to the two homozygous classes is that each heterozygous geno- 

1 38 Chapter 4 

type contains one A and one a allele Notice that when there is no inbreeding 
(f = 0), the genotype frequencies are in the familiar Hardy-Weinberg propor- 
tions; with complete inbreeding (F = 1), the inbred subpopulation consists 
entirely of AA and aa homozygotes in the frequencies p and q, respectively. 

If a gene has multiple alleles A lt A 2 , . . . , A„ at respective frequencies p h 
lh> • ■ - r p n (with p\ + p 2 + ••■ + p„ = 1), then in a population with inbreeding 
coefficient F, the frequencies of A,A t hornozygotes and A,A f heterozygotes are 
as follows: 

p?(l-F) + P ,F 


We are now in a position to apply the Equations 4.10 and 4.11 to real data. 

PROBLEM 4.6 Plants able to undergo self-fertilization are said to 
be self-compatible. In a population of self-compatible plants, if each 
plant undergoes self-fertilization a fraction s of the time and other- 
wise mates randomly, then it can be shown (Crow and Kknura 
1970; Hedrick and Cockerham 1986) that F very quickly attains the 
value F = s/(2 - s). Phlox cuspidata is self-compatible, and for this 
species the amount of self-fertilization is estimated at approximate- 
ly s = 0.78 (Levin 1978). From s we can predict the inbreeding coef- 
ficient as F = 0.78 /{2 - 0.78) ±= 0.64. In a Texas population of P. 
cuspidata, Levin (1978) found two electrophoretic alleles of the 
phosphoglucomutase-2 gene, designated Pgm-T and Pgtn-2 b . In a 
sample of 35 plants, there were 15 Pgm-T '/ Pgm-T ', 6 Pgm-T 1 j 'Pgm-t ', 
and 14 Pgm-2 h /Pgm-2 h genotypes. Are these numbers consistent 
with the estimate F = 0.64? (Note: The % 2 in this case has one degree 
of freedom because only the allele frequency is estimated from the 
data; if F also were estimated from the data, rather than being cal- 
culated independently from the degree of self-fertilization, then 
there would be zero degrees of freedom and no goodness-of-fit test 
would be possible.) 

AN SWER The al lele frequenc ies of Pgm-T and Pgm-2 h are estimated 

as (30 + 6)/70 = 0.514 and 1 - 0.514 = 0.486, respectively. The hypoth- 

Population Substructure 1 39 

*ste is that F * 0.64, and so 1 f F = 0.36. The expected numbers of the 
linotypes M, ab t and M are!, respectively, [(0.514) 2 (0.36) + (0.514) 
(0:64)K3S) » 14.8, [2(0314KQ.486}(0.36)]{35) = 6.3, and [(0.486) 2 (0.36) + 
{0.486){0.64)](35) « 13.9. With these expectations, the * 2 = 0.02 with 
one degree of freedom, and the associated probability is about 0.96. 
The fit to the mbfeeding model is excellent 

f Ff06liM4f AiiumMf tftatl'so 0.64 in Texas populations of Phlox 
■■' eusptialk, calculate the genotype fequeneies expected from the four 
, afleles dl tfw* gene Afih codirig for alcohol dehydrogenase by using the 
*;f allele frequencies O.ll (Atik-i), 0.84 {Adh-2), 0.01 (Adh-3), and 0.04 
I ! 0dh-4)ft^m t>fobiettt &10 in Chapter 3. 

..i . ^uMMi.^ .k i 


„ _.. wtprefsslohs in Equation 4.11, the expected 
ft^daerides Ate: Adh-1 /Adh-1 * 0.0748, Adh-1 /Adh-2 = 
M/Mk*2 - 0.7916; Adh-1 /Adh-3 = 0.0008, Adh-2 /Adh-3 = 
S4J Adh-3 $ 0.006^ Adh-1 /Adh-4 = 0.0032, Adh-2/ Adh-4 = 
\-3fAdh<4 * G.tXmMdh'4/Adh-4 = 0.0262. 

Relation Between the Inbreeding Coefficient 
and the F Statistics 

There is an intimate relation between the inbreeding coefficient F and the 
hierarchical F statistics examined in the first section of this chapter. Each of 
the hierarchical fi statistics is also a type ol inbreeding coefficient that mea- 
sures the reduction in heterozygosity at any level of a population hierarchy, 
relative to a higher level. The connection between the inbreeding coefficient 
and the F statistics is indicated by the formal similarity between Equation 4.7 
and the right-hand side of Equation 4.10. To incorporate the inbreeding coef- 
ficient F from mating between relatives into the hierarchical framework, we 
will embellish it with the subscript IS. In words, F, s is the inbreeding coefli- 


140 Chapter 4 

dent of a group of inbred organisms relative to the subpopulation to which 
they belong. The value of F IS is the reduction in heterozygosity of the inbred 
organisms, and the genotype frequencies among the inbred organisms are 
given by Equation 4.10 with p and q equal to the allele frequencies in the rel- 
evant subpopulation. Within each subpopulation there is random mating, 
and so the genotype frequencies are given by the HWE. Among the subpop- 
ulations, however, there is a reduction in average heterozygosity, relative to 
the total population, because mates within subpopulations often share 
remote common ancestors. The sharing of remote common ancestors 
explains the apparent paradox that inbreeding accumulates even when there 
is random mating within a subpopulation. The reduction in heterozygosity 
attributable to this type of inbreeding, relative to the total population, is mea- 
sured by F ST , and the appropriate formulas for the genotype frequencies, 
averaged across the subpopulations, are given in Equation 4.7, in which p 
and q are the average allele frequencies among the subpopulations. 

A population geneticist is often interested not only in F I5 but also in F iT . 
The former is the heterozygosity of a group of organisms relative to the sub- 
population to which they belong; the latter is the heterozygosity of the inbred 
organisms relative to the total population. Hence, F, T is the most inclusive 
measure of all inbreeding. It embraces not only the effects of mating between 
close relatives within a subpopulation but also the accumulated inbreeding 
resulting from mating between remote relatives at all levels of the population 
hierarchy. An expression for F JT is implicit in the definitions. For consistency, 
we will use the symbol H s to denote the heterozygosity in a particular sub- 
population. Hence, Equation 4.9 defining F IS may be rewritten as: 

fis = 



Similarly, if we use H T to denote the heterozygosity in the total popula- 
tion, the analogous equation defining F n is: 

Fn = 

H T 


Consequently, 1 - F, s = H,S H s and 1 - F IT = Hi/H T . However, the remarks 
in Problem 4.1 also indicate that 1 - F^ = H S /H T , and so by multiplication, 

(l-F (S )(l-F S] ) = l-F ir 


Hence, if we know both F IS and F sr , then we can obtain Fn- from Equation 
4.14. The value of F s , that results from mating between remote relatives in a 

Population Substructure 141 

subpopulation of limited size is taken up in Chapter 7 The value of F, s result- 
ing from mating between close relatives within a subpopulation ran be cal- 
culated from the pedigree of the inbred organisms by using an alternative 
probability interpretation of F IS defined in the next section 

The inbreeding Coefficient m a Probability 

The inbreeding coefficient F iS — which we will again call simply F unless the 
subscripts are needed for clarity — has an interpretation in terms of probability 
in addition to its interpretation in terms or heterozygosity spelled out in 
Equation 4.12. The probability interpretation is important in the calculation of 
F from pedigrees To express the inbreeding coefficient in terms of probability, 
imagine the two alleles of a gene present in a single inbred organism. Because 
the organism is inbred, the parents share one or more common ancestors. The 
two alleles present in the inbred organism could have been derived from the 
same ancestral allele by DNA replication in one of the common ancestors. In 
this case, the alleles are said to be identical by descent (IBD), and the genotype 
of the inbred organism is said to be autozygous. Conversely, the alleles may not 
be replicas of a single ancestral allele, in which case the alleles are not identical 
by descent, and the genotype is said to be allozygous. The probability inter- 
pretation of the inbreeding coefficient is that F is the probability that the two 
alleles of a gene in an inbred organism are IBD (autozygous). Note that the con- 
cepts of autozygosity and allozygosiry have nothing to do with the state of an 
allele — whether the allele is A or a, for example. The concepts are concerned 
only with common ancestry. If the alleles are replicas of a single allele in a 
common ancestor, they are autozygous; otherwise, they are allozygous. 

Interpreted as the probability of autozygosity, the inbreeding coefficient is 
clearly a relative concept. F measures the probability of autozygosity relative 
to some ancestral subpopulation. In defining the ancestral subpopulation, we 
arbitrarily assume thai all alleles present in the ancestral population are not 
identical by descent. The inbreeding coefficient of an organism in the present 
population is then the probability that the two alleles of a gene in the inbred 
organism arose by replication of a single allele more recently than the time at 
which the ancestral population existed. The ancestral population need not be 
remote in time from the present one. Indeed, the ancestral population, usu- 
ally presumed to be noninbred (F (S = 0), typically refers to the population 
existing just a few generations previous to the present one, and F [S in the 
present population then measures inbreeding that has accumulated in the 
span of these few generations. (Technically, any prior inbreeding is allocated 
to F 5T .) Because the span of time is usually short, the possibility of mutation 
can safely be ignored Autozygous genotypes must therefore be homozygous 
for some allele of the gene under consideration. On the other hand, allozy- 
gous genotypes can be either homozygous or heterozygous. 

\42 Chapter 4 

Cv\2y I Autozygous 

and homozygous 

Alleles in ancestral population, 

all presumed to be not 

identical bv decent 


and homozygous 


and heterozygous 

Genotypes in 
present population 

Figure 4.7 Tn a genotype that is autozygous, homologous alleles are derived 
from a single DNA sequence in an ancestor, and they are Iherefoie identical by 
descent. In an allozygous genotype, homologous alleles are not identical by 
descent. As shown here, allozygous genotypes may be heterozygous or 

homozygous, but autozygous genotypes must be homozygous (except in the 
unlikely event that one allele has mutated). 

Figure 4.7 illustrates how the concepts of autozygosity and allozygosity are 
related to those of homozygosity and heterozygosity. The essential point is that 
two alleles can be identical by state (IBS), which means that they have the 
same sequence of nucleotides along the DNA, without being identical by 
descent. The concept of identity by descent pertains to the ancestral origin of 
an allele and not to its chemical makeup. Although, as shown in Figure 4.7, two 
distinct alleles that are identical by state (for example, two A { alleles or two A 2 
alleles) may come together in fertilization and thereby make the inbred organ- 
ism homozygous, the alleles in the ancestral population are, by definition, not 
identical by descent, and so the genotype is allozygous. Similarly, although a 
heterozygous genotype must be allozygous (ignoring mutation), a homozy- 
gous genotype may be either autozygous or allozygous (see Figure 4.7). 

The probability interpretation of the inbreeding coefficient results in the 
same expected genotype frequencies as the heterozygosity interpretation set 
out in Equation 4.10. To verify the equivalence, we need only consider the 

Population Substructure 143 

implications of the probability definition for a subpopulation of inbred 
organisms. For this purpose, imagine a subpopulation in which the organ- 
isms have average inbreeding coefficient F. Consider the alleles of a gene pre- 
sent in any one of the inbred organisms Either of two things must he true: 
the alleles must either be allozygous (probability 1 - F) or be autozygous 
{probability F). If the alleles are allozygous, then the probability that the cho- 
sen organism has any particular genotype is simply the probability of that 
genotype in a random-mating population, because, by chance, the inbreeding 
has not affected this particular gene, On the other hand, if the alleles are 
autozygous, then the chosen organism must be homozygous, and the proba- 
bility of homozygosity for any particular allele is simply the frequency of 
the allele in the subpopulation as a whole, (Because the alleles in question are 
autozygous, knowing which allele is present in one chromosome immediate- 
ly tells you that an identical allele is in the homologous chromosome ) These 
considerations hold regardless of the number of alleles but, to simplify mat- 
ters, suppose there are only two alleles A and a at frequencies p and q (with 
p + q = 1). The probability thai an organism has genotype AA is therefore 
p 2 (l - F) + pF. In this expression, the first term refers to cases in which the 
alleles are allozygous and the second to cases in which the alleles are autozy- 
gous. Similarly, the probability that an organism has genotype aa is q\\ - F) 
4 (jF. Heterozygous Aa genotypes then have the frequency 2pq{\ - F) since 
alleles that are heterozygous must be allozygous. 

The genotype frequencies with inbreeding are summarized graphically in 
Figure 4.8. The box is divided vertically into two parts, corresponding to 
genes whose alleles remain allozygous in spite of the inbreeding and those 
whose alleles are autozygous because of the inbreeding The division is in the 
proportion I - F : F. Within the allozygous part of the box, the horizontal pan- 
els correspond to the allozygous genotypes AA, An, and aa, which are the 
Hardy- Weinberg frequencies. Within the autozygous part of the box, the hor- 
izontal panels correspond to the autozygous genotypes A A and aa, which are 
in the proportions p : q. The formulas for the genotype Irequencies with 
inbreeding aTe given in Table 4.3. Note that the genotype frequencies are 
exactly the same as those given in the Equations 4.10. This result shows that 
the autozygosity definition of F and the heterozygosity definition of F, 
though superficially quite different, are actually equivalent. 

Corresponding to the probability interpretation ol F lSl there is also a prob- 
ability interpretation of F ST . However, the comparison is not between homol- 
ogous alleles in the same organism but between homologous alleles drawn at 
random from the same subpopulation. Specifically, F S | is the probabilily of 
IBD between two alleles drawn at random from the same subpopulation. 
However, the inbreeding at this level is not realized as a departure from H WE 
but rather as differences in allele frequency among the subpopulations 
(Equation 4.6). The variance in allele frequency, in turn, results 111 a departure 

144 Chapter 4 

Population Substructure 145 

Probability that <i gene remains 
rtMio/tfous in spite of inbreeding 

Probability ,1 j*ene becomes 
twto:v$oit<; bom use of inbreeding 


Proportional to 


Proportional to 

amount of inbreeding (F) 

Figure 4.8 Graphical representation of the effects of inbreeding on genotype 
frequencies. Some genes remain allozygous in spite of the inbreeding, and 
among these the genotype frequencies of AA, Aa, and aa are given by the 
Hardy-Weinberg principle. Other genes are autozygous because of the inbreed- 
ing, and among these the genotype frequencies of AA and an are given by the 
allele frequencies. There are no heterozygotes in the autozygous case because 
the two alleles present at an autozygous locus are, by definition, identical by 

TABLE 4.3 


Frequency in Population 


With inbreeding 
coefficient f 

With f = 
(random mating) 

With F = J 
(complete inbreeding) 


AlUvygous Aulcvygous 
genes genes 

r 2 




from HWF in the genotype frequencies when averaged across subpopuia- 
tions (Equation 4 7) The probability interpretation of T ST makes the meaning 
of Equation 4.14 transparent It says that, in the total population, a pair of 
alleles will escape being IBD (1 - F n ) only if Ihey escape the effects ol mating 
between close relatives (1 - F, s ) and, independently, if they escape the cumu- 
lative inbreeding effects of mating between remote relatives due to popula- 
tion substructure (1 - F ST ). 

Cenetic Effects of inbreeding 

fn outcrossing species, which means species that regularly avoid inbreeding, 
close inbreeding is generally harmful. The effects are seen most dramatical- 
ly when inbreeding is complete or nearly complete. Although nearly com- 
plete autozygosity can be approached in most species by many generations 
of brother-sister mating, autozygosity of entire chromosomes can easily be 
accomplished in Drosophila by the sort of mating scheme shown in Figure 
4.9. In this diagram, Cy (Curly wings) and Pm (Plum-colored eyes) are domi- 
nant mutations present in certain laboratory second chromosomes that carry 
several long inversions to prevent recombination. In step A, a wildtype fly is 
mated with Cy/Pm; four genotypes of offspring are produced because the 
wildtype fly is heterozygous for two different wildtype chromosomes. From 
each cross in A, a single Cy son is chosen and mated with Cy/Pm. This step 
is shown in part B. Three classes of progeny are produced (because Cy/Cy 
is lethal); moreover, from each mating the Q//+ progeny ail carry wildtype 
second chromosomes that are IBD because they originated by replication of 
a single chromosome in the previous generation. In the cross in part C, the 
Cy/+ progeny from part B are mated among themselves; the expected 
progeny are +/+ and Cy/+ in the ratio % : %, and the wildtype homozy- 
gotes have second chromosomes that are IBD. For chromosome 2, these flies 
are completely inbred. In the mating D, Q//+ flies carrying two different 
wildtype chromosomes are crossed; again trie expected progeny are +/+ and 
Cy/+ in the ratio ] / 3 : 2 / 3 , but in this case the wildtype flies are heterozygous 
for different copies of chromosome 2 and are not completely inbred. 

For the matings in part C and part D, an estimate v of the viability (abili- 
ty to survive) of the +/+ genotype, relative to that of the Cy/+ genotype, is 
given by 

2 x Number (+/+) 
1 + Number (CV/+) 

4 15 

where Number (+/+) and Number (Q//+) are the counts of wildtype and 
Curly offspring, respectively {Haldane 1956). The addition of 1 to the'denom- 
inator makes the estimate of v almost unbiased. When the total number of 

146 Chapter 4 

offspring is large, v is essentially equal to two times the number of wild type 
offspring divided by the number of Curly offspring 

Results of an experiment using the procedure in Figure 4 9 are shown in 
Figure 4.10. It is evident that the homozygous genotypes (shaded histogram) 
are relatively poor in viability. In fact, about 37% of the homozygotes are 
lethal. Moreover, among the homozygotes that have viabilities within the 
normal range of heterozygotes (open histogram), virtually all can be shown 
to have reduced fertility (Sved 1975; Simmons and Crow 1977). Inbreeding so 
close as to make entire chromosomes homozygous is rare in outcrossing 
species, except in the kind of experiment in Figure 4.9, but the effects are 
clearly very harmful and provide a new dimension of genetic diversity In the 
case of allozymes, genelic diversity results from common alleles that do not 
perceptibly impair viability or fertility when homozygous In the case of 
inbreeding, the effects are mainly due to rare alleles that are severely detri- 
mental when homozygous. (The fact that the alleles are rare is shown by the 
small proportion of lethal or near-lethal heterozygotes.) Figure 4.10 shows 
that natural populations of Drosophila contain considerable hidden genetic 
variation in the form of rare deleterious recessive alleles. 

Detrimental effects of inbreeding, called inbreeding depression, aTe 
lound in virtually all outcrossing species, and the more intense the 
inbreeding, the more harmful the effects. Inbreeding in human beings is 
also generally harmful, but the effect is difficult to measure because the 
degree of inbreeding is less than that in experimental organisms; the 
effects may also vary from population to population. Nevertheless, chil- 
dren of first-cousin matings are, on the average, less capable than nonin- 
bred children in any number of ways (for example, higher rate of 
mortality, lower 1Q scores) — although it should be emphasized that many 
such children are within the normal range of abilities and some are quite 
gifted As in most organisms, inbreeding depression is largely due to the 

Figure 4.9 Mating scheme to extract wildtype chromosomes (in this case, the second chro- 
mosome) from populations of Drosophila melanogastcr Cif (Curly wings) and Pm (Plum eye 
color) are dominant mutations contained in certain special laboratory chromosomes that have 
multiple inversions to prevent recombination. From each mating of the type in part A, a single 
Ci/ son (containing one wildtype second chromosome) is selected. This son is backcrossed 
(part B) in order to reproduce many replicas of the second chromosome; the Cy progeny are 
selected foi further mating, and the other progeny are discarded. Brother-sister mating as in 
part C is expected to produce '/, Cy/Ci/, V 2 Cij/+, and V4 +/+ zygotes (where + denotes the 
wildtype second chromosome), the C1//C1/ zygotes do not survive, and so the surviving off- 
spring are % G//+ (Curly wings) and '/i +/+ (wildtype straight wings). Mating as in part D, 
between a female containing one wildtype second chromosome and a male carrying a differ- 
ent one, are also expected to produce 2 A Cfir/y-winged and V, straight-winged progeny How- 
ever, in mating C, the straight-winged flies are homozygous for a single wildtype second 
chromosome; whereas; in mating D, the straight-winged flies are heterozygous for two differ- 
ent wildtype second chromosomes 

Population Substructure 147 

(A) Male and select single Curly- winged son 

Select, - Q^O^CZj 

WtlfilifpeQ I Cv/Pmrf CHi/iz-winged 



<*9cf* ' r»>9cf c *9cf 


(B) Bnckcross a single Cy male from (A) and select Curly sons and daughters, 
which are heterozygous 


C y9</ Q/Cvfdies) PmQtf Cy/PmQtf 

(C) Mate heterozygotes for same wildtype chromosome and count proportn 
of non-Q<Wy offspring, 

Count -*•". 

+ 9d C ?9C/ C *9CT Oldies) 


tvpect \ non-Ci/ 2 

i Ci/ 

(D) Mate heterozygotes fen different chromosomes and count proportion 
of non-Curly offspring 

Count — ' _ __ 

+ 9cf C *9C/ C v9d Cy/Cy (dies) 


Expect \ non-Ci/ " 7" 

i Cm 

148 Chapter 4 

05 0.15 

35 55 075 0.95 

Viability (relative (0C1//+) 



Figure 4. 1 Viability distributions of wildtype homozygotes (shaded area) 
and wildtype heterozygotes (black outline) of second chromosomes extracted 
from Dmsophila melanogaster according to the mating scheme in Figure 4 9. The 
histograms depict results of testing 691 homozygous combinations and 688 het- 
erozygous combinations. Note that, in this sample, nearly 37% of the wildtype 
chromosomes axe lethal when homozygous, and many more have viabilities 
substantially below normal. (Data from Mukai et al. 1974.) 

increased homozygosity of rare recessive alleles, and so inbreeding effects 
in human beings are seen most dramatically in the increased frequency of 
genetic abnormalities due to harmful recessive alleles among the children 
of first-cousin matings. The increased frequency of such conditions results 
from the genotype frequencies given in Table 4.3. If a denotes a rare dele- 
terious recessive allele, Vi6 then, among the children of first-cousin mat- 
ings, the frequency of art is q 2 (l - VifJ+rj (V™) because, for these children, 
F = V](, , as will be shown in the next section. On the other hand, with ran- 
dom mating, the frequency of recessive homozygotes is q 1 . Thus, the risk 
of an affected offspring from a first-cousin mating relative to that from a 
mating of nonrelatives is given by 

7'(i-^(«.L a9375+ ao625 


For example, when q = 0.01, the increased risk is approximately 7; that is, 
a first-cousin mating has seven times the chance of producing a homozygous 
recessive child as compared to a mating between nonrelatives when the fre- 
quency of the harmful recessive allele is 01. There is clearly a dramatic 

Population Substructure 149 

inbreeding effect— and the rarer the frequency of the deleteiious recessive 

allele, the greater the effect. 

PROBLEM 4.8 Relative to the risk with random mating, calculate 
the risk of a homozygous recessive offspring from a mating of second 
cousins (F = Vtt) when the recessive allele frequency is q = 0.01. 

ANSWER In general, the relative risk is given by [q 2 (l - F) + qF]/q 2 
= (1 - F) + F/q. For F = i/«, this becomes 0.9844 + 0.0156/ij, and the 
value for q - 0.01 is approximately 2.5. 

Calculation of the Inbreeding Coefficient from Pedigrees 

Computation of F from a pedigree is simplified by drawing the pedigree in 
the form shown in Figure 4.11A, where the lines represent gametes con- 
tributed by parents to their offspring. The same pedigree is shown in con- 
ventional form in Figure 4.11B. The organisms in gray in part B are not rep- 
resented in part A because they have no ancestors in common and therefore 
do not contribute to the inbreeding of the organism denoted I. The inbreed- 
ing coefficient F, of I is the probability that 1 is autozygous for the alleles of 




Figure 4. 1 1 (A) Convenient way to represent pedigrees for calculation of the 
inbreeding coefficient. In this case, the pedigree shows a mating between half- 
first cousins. (B) Conventional representation of the same pedigree as in part A. 
Squares represent males, circles represent females, and the shaded organisms in 
part B are not* depicted in part A because they do not contribute to the inbreed - 
»ng of the inbred organism designated I, 

150 Chapter 4 

an autosomal gene under consideration. The first step in calculating F, is to 
locate all the common ancestors in the pedigree, because an allele could 
become autozygous in I only it it were inherited through both of I's parents 
from a common ancestor; in this case, there is only one common ancestor, 
namely, A. The next step in calculating F,, which is carried out for each com- 
mon ancestor in turn, is to trace all the paths of gametes that lead from one 
of I's parents back to the common ancestor and then down again to the other 
parent of 1. These paths are the paths along which an allele in a common 
ancestor could become autozygous in I. In Figure 4.11 A, there is only one 
such path: DBACE, in which the common ancestor is underlined for book- 
keeping purposes, an especially useful procedure in complex pedigrees. 

The third step in calculating F, is to calculate the probability of autozy- 
gosity in I due to each of the paths in turn. For the path DBACE, the reason- 
ing is illustrated in Figure 4.12. Here the black dots represent alleles 
transmitted along the gametic paths, and the number associated with each 
step is Ihe probability of identity by descent of the alleles indicated. For all 
steps except that around the common ancestor, the probability is V 2 because, 
with Mendelian 'segregation, the probability that a particular allele present in 
a parent is transmitted to a specified offspring is V 2 . To understand why 
V 2 (l + F A ) is the probability associated with the loop around the common 
ancestor, denote the alleles in the common ancestor as a, and a 2 . These sym- 
bols are used to avoid confusion with conventional allele symbols designat- 
ing functional types of alleles, such as A for dominant and a for recessive. 
The pair of gametes contributed by A could contain dice,, a 2 a 2 , ot]a 2 , or a 2 ai, 
each with a probability of 'A because of Mendelian segregation. In the first 
two cases, the alleles are clearly identical by descent, in the second two cases, 

1/2(1 + F A ) 

Figure 4.12 Loops for the pedigree in Figure 4.11 A, showing probabilities that 
designated alleles (solid dots) are identical by descent Each loop is independent 

of the others, so their probabilities multiply thus, the inbreeding coefficient 

of organism 1 is F r = ("A/'O + F A ), where F A i 
of the common ancestor. 

represents the inbreeding coefficient 

Population Substructure 151 

the alleles are identical by descent only if a, and o 2 are already identical bv 
descent, which means that A is autozygous. The probability that A is autozy- 
gous is, by definition, the inbreeding coefficient of A, F A . Hence, the proba- 
bility for the step around the common ancestor A is '/ 4 + '/ 4 +. i/ 4 F A '+ '/, f A = 1/ 
+ '/ 2 Fa = V 2 (l + F A )- Because each of the steps in Figure 4.12 is independent of 
the others, the total probability of autozygosity in 1 due to the path through 
A is '/ 2 x V 2 x i/ 2 (l + F A ) x i/ 2 x % or («/ 2 ) , (l + F A ). Note that the exponent on the 
'/ 2 is simply the total number of ancestors in the path. In general, if a path 
through a common ancestor A contains i individuals, the probability of 
autozygosity due to that path is 

w 2 m + F A ) 

Thus, the inbreeding coefficient of f in Figure 4 11 A is ('A) s (l + F x ) 
Assuming that A is not inbred (F A = 0), the inbreeding coefficient of I reduces 

to (v 2 y = '/ 32 . 

In pedigrees of greater complexity, there is more than one common ances- 
tor and there may be more than one path through any of the common ances- 
tors. The paths are mutually exclusive because autozygosity due to an allele 
inherited along one path excludes autozygosity due to an allele inherited 
along a different path. Thus, the total inbreeding coefficient is the sum of the 
probabilities of autozygosity due to each path considered separately The 
whole procedure for calculating F is summarized in an example of a first- 
cousin mating in Figure 4.13. In a first-cousin mating, there are two common 


Paths GDACE 

Contribution to F,: Wif{\ + T A > 


Figure 4.13 On the left is a pedigree of individual I, the offspring of ,i first- 
cousin mating. On the right are the two paths through common ancestors 
(Heavy lines) used in calculating the inbreeding coefficient of i Below each path 
is me contribution to F, due to that path, calculated as in Figure 4 12 Fadi path 
]s mutually exclusive of the others, and so their probabilities add rhu.s, llw tola 
inbreeding coefficient of I is the sum of the two separate contributions If / A = 
Fn - 0, then F , = V lfv K A 

j 52 Chapter 4 


ancestors (A and B) and two paths (one each through A and B). The total 
inbreeding coefficient of 1 is the sum'of the two separate contributions shown 
in Figure 4.13. If A and B are both noninbred, then F A = F B = 0, and so F t - 
( i/ 2 f + (i/ 2 ) s = V 16 ; this result is the probability that I is autozygous at the spec- 
ified locus. Alternatively, F, can be interpreted as the average proportion of all 
genes in 1 in which the alleles present are autozygous. 

In general, for any autosomal gene, the formula for calculating the 
inbreeding coefficient F, of an inbred organism I is 



in which the summation I over A means summation over all possible paths 
through all common ancestors, / is the number of organisms in each path, 
and A is the common ancestor in each path. 

PROBLEM 4.9 The accompanying pedigree depicts two generations 
of brother-sister mating. Calculate the inbreeding coefficient of I, 
assuming that none of the common ancestors is inbred. (Altogether, 
there are four common ancestors and six paths.) 

ANSWER F, = <V 2 ) 3 (1 + Fc> + (V/d + F D ) + (Wfl + Fa) + (V 2 ) 5 + 
F A ) + (V 2 ) s (l + F B ) + ( l &) a (l + F &)- When the common ancestors are 
assumed to be noninbred, then F A = H = F c = F D = 0, and so F t * 3 / 8 - 

Population Substructure 153 

Regular Systems of Mating 

In plant and animal breeding, it is often important to know how rapidly the 
inbreeding coefficient increases when a strain is propagated by a regular sys- 
tem of mating, such as repeated self-fertilization, sib mating, or backcrossing 
to a standard strain. The reasoning involved in calculating the inbreeding 
coefficient for any generation is illustrated in Figure 4.14 for repeated self- 
fertilization. In this figure, the labels f - 1 and / refer to the inbred organisms 
after / - 1 and t generations of self-fertilization. The loop around the ances- 
tor in generation f - 1 designates the probability that the two indicated alleles 
are identical by descent. Here the formula in Equation 4.17 applies with only 
one path and only one ancestor in the path, and so F, = ( l / 2 )V + F M ), where 
Ft is the inbreeding coefficient in generation t. This equation is easy to solve 
in terms of the quantity 1 - F f , which is often called the panmictic index, 
panmixia being a synonym for random mating. Multiplying both sides of the 
equation for F t by -1 and then adding +1 to each side leads to 1 - F, = 

i - y 2 (i + f m ) = l - >/ 2 - y 2 F M = i/ 2 (i _ f m) , or 



where F is the inbreeding coefficient in the initial generation when the 
repeated self-fertilization begins. Self-fertilization therefore leads to an 
extremely rapid increase in the inbreeding coefficient. When F = 0, then 
Fi = Vi, F 2 = y 4/ F 3 = %, F 4 = 15 / 16 , and so on. The increase in F under self- 
fertilization and several other regular systems of mating is shown in Figure 

Many plants reproduce predominantly by self-fertilization, including 
crop plants such as soybeans, sorghum, barley, and wheat. As expected of 

wi + *;_,.) 

Figure 4.14 Increase in F resulting from continued self-fertilization The 
organism in generation r is the offspring of self -fertilization of the organism in 
generation ( - 1 . The loop shows that F, = 1/2(1 + F, . ,). 

154 Chapter 4 

/Repeated backcrossirtg 
/ to inbred strain 

8 10 12 
Generations (f) 

Figure 4. 1 5 Theoretical increase in the inbreeding coefficient F for regular 
systems of mating: selfing, sib mating, half-sib mating, and repeated backcross- 
ing to a single organism from a random-bred strain In each case, the initial 
value of F is assumed to be F fl = 0. 

highly self-fertilizing species, each plant is highly homozygous for alleles 
such as those determining allozymes. Yet the proportion of polymorphic 
genes is comparable to that found in outcrossing species. Polymorphisms are 
found because self-fertilization does not eliminate genetic variation; it simply 
reorganizes genetic variation into homozygous genotypes. On the other 
hand, self-fertilizing species do contain fewer deleterious recessives than do 
outcrossing species, presumably because the increased homozygosity per- 
mits harmful recessives to be eliminated from the population by natural 
selection. One other important point about naturally self-fertilizing species: 
The high homozygosity of all genes implies that recombination rarely results 
in new gametic types not already present in the parent. Therefore, predomi- 
nance of selfing has the effect of retarding the approach to linkage equilibri- 
um because the approach to linkage equilibrium is through recombination 
in double heterozygotes (AB/ab and Ab/aB in the case of two alleles at each 
locus); with extreme inbreeding, such double heterozygotes are rare. Indeed, 
the most extreme examples of linkage disequilibrium have been found in pre- 
dominantly self-fertilizing species such as barley {Hordeum vulgare) and wild 
oats (Avena barbata). 

Barley, which regularly undergoes more than 99% self-fertilization, pro- 
vides an extreme example of linkage disequilibrium between two unlinked 
esterase genes (Clegg et al. 1972). A population that had originated as a com- 
plex cross was maintained for 26 generations under normal agricultural con- 
ditions without conscious selection. The population was polymorphic for 

Population Substructure 1 55 

two alleles B r and B 2 of an Esterase-B gene and also polymorphic for two alle- 
les D\ and D 2 of an Esterasc-D gene. The gametic types were found in the fol- 
lowing proportions. For all practical purposes, these numbers also refer to 
homozygous genotypes because there is such close inbreeding. 

B,D, 1501 (1642.6) 

B r D 2 754 (613.7) 

B 2 D } 
B 2 D 2 



(The numbers in parentheses are the expected numbers based on the 
assumption of linkage equilibrium, calculated as in Chapter 3.) The y} value 
in this case is 172.7 with one degree of freedom. The associated probability 
is much les& than 0.0001, and so there is undoubtedly linkage disequilibri- 
um. For the above data, the linkage disequilibrium parameter (Equation 3.9) 
is D = -0.046, which is about 66% of its theoretical minimum. 

One of the dramatic successes of plant breeding has come from the crossing 
of inbred lines to produce high-yielding hybrid corn. Yield of a genetically 
heterogeneous, outcrossing variety of corn can be improved by selecting the 
plants with the highest yields in each generation to be the progenitors of the 
next generation; such artificial selection results in only gradual improvement, 
however (see Chapter 9). If a large number of self-fertilized lines are estab- 
lished from a heterogeneous population, each line declines in yield as inbreed- 
ing proceeds, owing to, the forced homozygosity of deleterious recessives. 
Many lines become so inferior that they have to be discontinued. Self-fertilized 
lines are not likely to become homozygous for exactly the same set of deleteri- 
ous recessives, however, and when different lines are crossed to produce a 
hybrid, the hybrid becomes heterozygous for these genes Alleles favoring high 
yield in com are generally dominant, and there may also be genes in which the 
heterozygous genotypes have a more favorable effect on yield than do the 
homozygous genotypes; in any case, the hybrid has a much higher yield than 
either inbred parent. The phenomenon of enhanced hybrid performance is 
called hybrifl vigor or heterosis. In practice, inbred lines are crossed in many 
combinations to identify those that produce the best hybrids. Yields of hybrid 
corn are typically 15 to 35% greater than yields of outcrossing varieties, and the 
successful introduction of hybrid corn has been remarkable. Virtually all corn 
acreage in the United States today is planted with hybrids, as compared to 
4% of the acreage in 1933 (Sprague 1978). 


When choice of mates is based on phenotypes, mating is said to be assorta- 
tive Most assortative mating is positive assorialive imifm$; this term means 

1 56 Chapter 4 

that mating pairs have, on the average, mo* smnlo r phenotypes *™ **£^ 
ed with random mating. The qualifier "on the average is important. Even 
when mating is random, some mating pairs are phenoty pl cal!y ^.!ar jnd 
so positive assortative mating refers only to those situations .n which mating 
P Xrs are phenotypically more similar than would be expected by chance 

enC There r are also examples of negative assortative matingsomelxmes called 
Assortative nmting-m which mating pairs are more dissimilar than expect- 
ed by chance. One case of negative assortative mating is a polymorphism 
known as heterosty.y found in most species of primroses (Pr '*m™**™ 
relatives. The heteroslyly polymorphism refers to the relative lengths of he 
styles and stamens in the" flowers (Figure 4.16) (In ibotanical ™ology th 
style is a stalk bearing the stigma, which is the female organ >^t recedes 
pollen; the stamen is the male organ bearing anthers, in wh,ch he pol en is 
produced.) Most populations of primroses contain approximate y equal pro 
portions of two types of flowers, one known as pm which has a taH sty* .and 
short stamens, and the other known as thnm, which has .a ^°$**£f 
stamens. In heterostyly, insect pollinators that work h,gh on the flowers pick 
up mostly thrum pollen and deposit it on P in stigmas, whereas po lbna ore 
that work low in the flowers pick up mostly pin pollen and deposit it on 

(A) Pin 

(|)) Thrum 


Flqure 4.16 Diagrams of cross sections of (A) pin and (B) thrum flowers of the 
primrose Pmm/Ahe pin flowers have a long style and short stamens .the 
KStavere have a short style and long stamens. The differences ,n flower 
myology assist in the maintenance of negative assortative mating mediated 
by insect pollinators 

Population Substructure 157 

thrum stigmas. Negative assortative mating therefore takes place because 
pins mate preferentially with thrums. Additional floral adaptations facilitate 
the negative assortative mating. For example, pollen grains from pin flowers 
fit the receptor cells of thrum stigmas better than they do their own, and 
pollen grains from thrum flowers germinate better on pm stigmas than they 
do on their own. 

The pollination biology of flowering plants also provides examples of 
positive assortative mating. For example, when the length of time in which 
any plant flowers is short relative to the total duration of the flowering sea- 
son, then plants that flower early in the season are preferentially pollinated 
by other early flowering plants, and those that flower late are preferentially 
pollinated by other late flowering ones. Thus, there is positive assortative 
mating for flowering time. 

In human beings, positive assortative mating is observed for height, IQ 
score, and certain other traits, although assortative mating varies in degree in 
different populations and is absent in some. As might be expected, positive 
assortative mating is found for certain socioeconomic variables. In one study 
in the United States, the highest correlation found between married couples 
was in the number of rooms in their parents' homes. Negative assortative 
mating is apparently quite rare in human populations. 

In certain species of Drosophih, a curious type of nonrandom mating is a 
phenomenon called minority male mating advantage, in which females mate 
preferentially with males with rare phenotypes. For example, in a sludy of 
experimental populations of D. pseudoobscum containing flies homozygous 
for either a recessive orange eye-color mutation or a recessive purple eye-color 
mutation, Ehrman (1970) found that, when 20% of the males were orange, the 
orange-eyed males participated in 30% of the observed matings; conversely 
when 20% of the males were purple, the purple-eyed males participated in 
40% of the observed matings. 

The consequences of positive assortative mating are complex. They 
depend on the number of genes that influence the trait in question, on the 
number of different possible alleles of the genes, on the number of different 
phenotypes, on the sex performing the mate selection, and on the criteria for 
mate selection. Traits for which mating is assortative are rarely determined 
by the al leles of a single gene, however. Most such traits are polygenic, so rea- 
sonably realistic models of assortative mating tend to be rather complex 
Here we should note one obvious, qualitative consequence of positive assor- 
tative mating: since like phenotypes tend to mate, assortative mating gener- 
ally increases the frequency of homozygous genotypes in the population at 
the expense oMieterozygous genotypes, and thus the phenorypic variance in 
I he population increases. (Negative assortative mating generally has the 
opposite effect.) 

1 58 Chapter 4 


Species that are spread over a large geographical area are usually divided 
into subpopulations. Matings between organisms within the same subpopu- 
lation are more likely than matings between organisms in different subpop- 
ulations. Geographical subdivision of a population is called population sub- 
structure The genetic consequences ol population substructure result from 
the fact that the frequencies of alleles may differ from one subpopulation to 
the next. When the allele frequencies differ, the average heterozygosity 
among the subpopulations is smaller than that expected with random mat- 
ing in the total population. Many populations are subdivided into groups 
within larger groups, a kind of structure called a hierarchical population 
structure. The F statistics are a quantitative measure of the reduction in het- 
erozygosity at various levels in a population hierarchy. For example, F SR is 
the proportionate reduction in average heterozygosity among subpopula- 
tions (S) as compared to that expected with HWE within regions (R): 
Fsk = (Hr - Hc,)/H R . Similarly, F RT is the proportionate reduction in average 
heterozygosity among regions (R) as compared to that expected with HWE 
in the total population (T) F R1 = (H r - H R )/Hj. The fixation index F ST com- 
bines the effects due to subdivision into subpopulations within regions and 
regions within the total population: F ST = (H T - H S )/H T . Generally speaking, 
an F statistic with a value smaller than 0.05 indicates little genetic differenti- 
ation, a value from 0.05 to 0.15 indicates moderate genetic differentiation, 
from 0.15 to 0.25 indicates great genetic differentiation, and above 0.25 indi- 
cates very great genetic differentiation among subpopulations 

When subpopulations undergo fusion and random mating, the deficien- 
cy of heterozygotes is eliminated. Said another way around, the excess of 
homozygous genotypes in a subdivided population is eliminated by popu- 
lation fusion and random mating. This effect of population fusion is called 
the Wahlund principle. Quantitatively, the Wahlund principle implies that 
population fusion and random mating will cause a reduction in the frequen- 
cy of any homozygous genotype by an amount equal to the variance in allele 
frequency among the original subpopulations. For two alleles, the Wahlund 
effect is related to the fixation index by the relation F ST = <J 2 /(p x q ). In terms 
of the fixation index, the average genotype frequencies across subpopula- 
tions are: AA with average frequency p 2 {\ - F ST ) + pF ST , Aa with average fre- 
quency 2 pq (1 - F S j), and aa with average frequency q 2 {\ - F ST ) *- Wsi- 
Despite the departure from HWE when genotype frequencies are averaged 
across subpopulations, within each subpopulation mating is random and 
the genotype frequencies are in HWE for the allele frequencies in the sub- 

Inbreeding means mating between relatives. The most important effect 
of inbreeding is that replicas of a single allele in a common ancestor may be 
transmitted down both sides of the pedigree and come together in fertiliza- 

Population Substructure 159 

Hon to produce the inbred organism. In such a case, the inbred organism is 
said to be autozygous, and the alleles are identical by descent (I0D) Other- 
wise the inbred organism is allozygous. The inbreeding coefficient F is the 
probability that the two homologous genes in an inbred organism are IBD. 
With close inbreeding among parents with relatively recent common ances- 
tors, the value of F can be calculated from elementary probability considera- 
tions using the formula F = E (>/ 2 )'(l + F A ), where the summation is over all 
paths from one parent to the other through each common ancestor, i is the 
number of organisms in the path, and F A is the inbreeding coefficient of the 
common ancestor in the path. Amongorganisms in which the inbreeding 
coefficient is F, the genotype frequencies of a gene with two alleles are, for 
AA, p\\ - F) + pF; for Aa, 2pq{l - F); and for aa, q 2 (l - F) + qF Hence, one of 
the most important consequences of close inbreeding is an increased risk of 
homozygosity of rare recessive alleles— q 2 {l - F) + qF for inbred organisms 
versus q for noninbred organisms. In human populations, a substantial pro- 
portion of children affected with rare, homozygous recessive genetic diseases 
have first-cousin parents, although first-cousin mating is infrequent. 

Population substructure results in an accumulation of inbreeding because 
mating pairs within subpopulations will often have remote relatives in com- 
mon, even when mates are chosen at random. Thus, the inbreeding coeffi- 
cient F resulting from nonrandom mating within a subpopulation should be 
designated F JS . The total inbreeding resulting from nonrandom mating 
combined with all levels of population substructure is given by the expres- 
sion (1 - F IT ) = (1 - F IS ) x (1 - F«rr). 


1. Two diploid random mating populations have allele frequencies q + e and 
q - e for a recessive allele of a gene. What are the frequencies of homozy- 
gotes before and after population fusion? 

2. Show that F IT = F IS +■ F JT - F (S F| T and interpret the expression 

3. Calculate F ST among the three random-mating populations below based 
on the specified allele frequencies. What is the maximum value of F sr in 
this situation? 

Population Population J Population 2 Population 3 

Allele 1 
Allele 2 
Allele 3 





4. Calculate F IS , F ST , and F ( | for the populations with the genotype frequen- 
cies shown in the following table: 

1 60 Chapter 4 

Population Substructure 1 61 

Population 1 Population 2 

Genotype A A 



5. Suppose two subpopulations with equal allele frequencies of two h^ 
genes have an amount of linkage disequilibrium that is equal but opp,*. 
in sign. What is the amount of linkage disequilibrium in a popu^, 
formed by mixing equal numbers of individuals from the two populate 

6. Show that p 2 (l - F) + pF = p 2 + pqF = p-(l- F)pq, when q = 1 - p. 

7. With two alleles and p = V z , what are the expected genotype frequ e n fl 
in a random mating population and among the offspring of first cousir 
How great is the decrease in heterozygosity in the inbred population r> 
ative to the random mating population? 

8. If the frequency of an autosomal recessive disorder is 1/1600 arno- 
unrelated parents, what is the expected frequency among the offspring 
first cousins? 

9. For a recessive allele at frequency q in a population in which one pe^t- 
of the matings are between first cousins, but otherwise occur at randr 
the proportion of affected individuals having first-cousin parenl* 
(1 + 15<j)/(l + \599q). Calculate for q = 0.1, 0.05, 0.1, 0.005, and OP 
Interpret the result of the equation when q = 1. 

10. In a population of monoecious plants in Hardy-Weinberg proportions) 
two alleles with allele frequency p, what is the variance in allele frequr 
cy among plants? What is the variance il the population were cornpH 
inbred? If a random mating population were to undergo self-fertilizal' - 
what would the variance be when the inbreeding coefficient equals F 

11. The measure of genetic divergence G ST is very useful for multiple alle 
in multiple subpopulations. G ST can be defined as (J s - J T )/(1 -/?), wl* 
p, is the frequency of the /In allele, J s = IAvg(p,) and J r = I[Avg(p,)f (\i 
1987). The summation means summation over all alleles, and Avg mer 
the average over all subpopulations. For the random mating populali 
below, calculate F S t and Gst- 

Population 1 Population 2 

Allele 1 
Allele 2 
Allele 3 



12. G ST for multiple alleles is actually a weighted average of Fst va,l! 
G S r = £Pr(l - p,)F ST(r) /Ep,(l - p,), where the summation is over all all? 
p, is the average frequency of the ith allele among the subpopula^ 
and F ST (o is the F ST value for the ith allele calculated as if the ge ne 

only two al leles with frequencies p, and 1 - p, in each stipulation Cal- 
culate Fg,,,, for each allele ,n the preceding problem and confirm numeri- 
cally that the weighted average equals G sr . 

13. In calculating F from pedigrees for X-linked genes, why are paths with 
two or more consecutive males not counted? 

14. What is the coefficient at relationship between / and / in the accompanv- 
ing pedigree, where I and / are the offspring of a pair of first cousins 
(A, B) mated with another pair of first cousins (C D)? 

J ~ - l? ' Yn r^L 

15. Assuming F A = F B = 0, calculate the inbreeding coefficient for each of the 
individuals C - / in the accompanying pedigree. 


16. If a population is maintained by self-fertilization in even-numbered gen- 
erations and by random mating in odd-numbered generations, what hap- 
pens to the inbreeding coefficient? 

17. For a gene with two alleles and p = 0.3, what are the expected genotype 
frequencies after five generations of sib mating? What are the expected 

i a f A T 0,y P e fre( I uen cies after one additional generation of random mating? 

18. What is the inbreeding coeff.cient in a population of size 50 that under- 
goes ^ 

a. 47 generations of random mating followed by three generations of sib 
> b. 50 generations of random mating? 

19. In gametophytic self-incompatibile plants, the pollen can only fertilize 
ovules whose genotype has neither allele borne by the haploid pollen. In 

162 Chapter 4 

a plant population at equ.librium with three gametophytic «»-™^" 
patibility alleles, what is the probability that a pollen gram will land on a 

20. ZvwayhybHd'corn is produced by crossing two d-He^t inb«d_Linej; 
three-way hybrids are produced by crossing a two-way hybrid an 
unrelated inbred; and four-way hybrids are produced by crossing two 
different two-way hybrids. What is the inbreeding coefficient of the off- 
spring of randomly mated two-way, three-way, or four-way hybrids? 
(Hint: Consider the allele frequencies in gametes.) 

Denve a recurs,on equation for F, for repeated parent-offepnng mating 
(see pedigree), and calculate F, for I = to 5. 


22. Derive a recursion equation for F, for repeated backcrossing to a singe 
noninbred individual A (see pedigree). Calculate F, for t = to 5 and the 
equilibrium value. 


Sources of Variation 

Mutation Infinite Alleles Model Neutral Mutations 

Recombination Migration Transposable Elements 


eneti& includes several processes that create new types of genet- 
ic variation in populations or that allow for the reorganization of 
previously existing variation either within genomes or among 
subpopulations. The ultimate source of genetic variation is mutation, by 
which we mean any heritable change in the genetic material. Mutation there- 
fore includes a change in the nucleotide sequence of a single gene as well the 
formation of a chromosome rearrangement, such as an inversion or a translo- 
cation. Recombination brings mutations of different genes together into the 
same chromosome. Migration enables mutations to spread among subpopu- 
lations. A transposable element is a DNA sequence able to replicate and 
insert into any of a large number of sites in the genome. By insertion in or 
near a gene, a transposable element can alter the level or pattern of gene 
expression; recombination between transposable elements can result in a 
chromosome rearrangement, for example, an inversion In this chapter, we 
consider the processes by which genetic variation is created. 


Mutation is the ultimate source of genetic variation for evolutionary change 
However, most wild type genes mutate at a very low rate, typically in the 
range from 10 4 to HT 6 new mutations per gene per generation. Even a low 
mutation rate can create manv new mutant alleles because, in a large popu- 
lation, each of a large number of genes is at risk of mutating In a population 


164 Chapters 

of size N diploid organisms, there are 2N copies of each gene, each of which 
cam mutate in any generation. Mulalions arc rare, but in a large population 
there are many alleles at risk. For example, if the mutation rate (probability 
of mutation) is I0~ Q per nucleotide pair per generation, then in each human 
gamete, the DNA of which contains Iff nucleotide pairs, Ihere would be an 
average of three new mutations in each generation; each newly fertilized egg 
would carry, on the average, six new mutations. The present-day human 
population of approximately 6 billion people would therefore be expected to 
carry approximately 36 billion new mutations that were not present even one 
generation earlier. 

Irreversible Mutation 

Although mutation may create a new allele, the initial frequency of the 
mutant allele must be very small if the population size is large. A single new 
mutant allele in a diploid population of size N has an initial frequency of 
1 /2N. New mutations in subsequent generations may augment the number 
of mutant alleles, but recurrent mutation alone increases the allele frequen- 
cy of the mutant very slowly. Consider an example in which A is the wild- 
type allele and a the mutant form If there is exactly one new mutation per 
generation, then the allele frequency of a increases according to the series 
1/2N, 2/2N, 3/2IV, . . . and, if N is large (lor example, N = W\ then the 
increase is very slow indeed. Hence, the tendency for allele frequency to 
change as a result of recurrent mutation (mutation pressure) is very small. 
On the other hand, the cumulative effects of mutation over long periods of 
time can become appreciable. 

A useful model for thinking about mutation is the Hardy- Weinberg model 
of Chapter 3, but with mutation permitted. For the moment, we focus on muta- 
tions that have so little effect on the ability of the organism to survive and 
reproduce that natural selection does not appreciably influence their frequen- 
cy. We will also assume that mutation is irreversible, which means that a cannot 
reverse-mutate to A To avoid complications resulting from change in allele fre- 
quency due to chance, we will assume a population that is infinite in size 

Consider a gene with two alleles, A and a, and suppose that A mutates to 
a at a rate of p mutations per A allele per generation. In other words, each A 
allele has a probability of u of mutating to a in any generation. We will sym- 
bolize the allele frequency of A as p and that of a as q and keep track of gen- 
erations with subscripts Hence, p { and q t are the allele frequencies of A and a, 
respectively, in the fth generation, where t = 0, 1, 2, . . . In any generation, 
p, + q, = 1 because A and a are the only alleles considered. 

Next we will deduce a formula for the allele frequency p, in terms of the 
allele frequency p M in the previous generation. In generation f, p, includes all 
the A alleles in generation t that did not mutate in that generation, and so 

Pt = Pt-\ x (1 - u) 

Sources of Variation 1 65 

However, by the same reasoning, p M includes all A alleles in generation 
[- 1 that did not mutate in that generation, and so />, , = p,_ 2 x (1 - p). Sub- 
stituting this equation into the one above yields 

Pl=Pl-2* (1 -(->)' 

Continuing in the same manner leads eventually to 

The effect of mutation pressure on allele frequency is illustrated in Figure 
5.1 for the case u = 10" 4 . The allele frequency of A decreases very slowly, 
almost linearly at first because the governing term in Equation 5 1,(1- u)', is 
approximated by 1 - pi when I is sufficiently small. After 1000 generations, 
the allele frequency of A is still 0.90; however, at / = 10,000 generations, 
p, = 0.37; and at t = 20,000 generations, p, = 0.14. 

One instructive w'ay to analyze Equation 5.1 is to consider the time 
required to reduce the allele frequency of A by half. To find the "half-life" of 
the process, set p, = 0.5 x p ; this relationship implies that 0.5 = (1 -p)'. Taking 
logarithms of bothVides, we obtain 

t xn = In (0.5)/ln (1 - p) = 0.6931 /u ' ' 

In the example in Figure 5.1, t l/2 = 6931 generations. A decrease in p by a 
factor of 10 increases f 1/2 accordingly to approximately 69,310 generations 
for u = 10 5 and to approximately 693,100 generations for p = 10" 6 . The fact 


20,000 30,000 

Time (f, in generations) 


Figure 5.1 Change in frequency under mutation pressure, In this example, an 
allele A mutates to a at a rate of u = 1 x 10 4 per generation, p, is the allele fre- 
quency of A in generation t. We assume that p = 1. With the given value of p, 
the allele frequency decreases by half every 6931 generations. 

166 Chapters 

that mutation pressure is a weak force for changing allele frequency is illus- 
trated by the long half-lives calculated for realistic values of the mutation rate. 
As noted with reference to Equation 5.1, the approximation p, = po(l - uf) 
is quite accurate for small values of f. With respect to the allele frequency of 
the mutant allele a, the approximation can also be written as q t = q + uf, 
provided that q is small. This approximation implies that the allele fre- 
quency ol the a allele increases linearly with time with a slope equal to u. 
Because u is small, however, the linear increase in % «s difficult to detect 
experimentally except in very large populations. A large population size 
can be attained in a bacterial chemostat, which is a device for maintaining 
a population of bacteria in a continuous state of growth and cell division 
(Figure 5.2). The linear increase in q, from mutation pressure observed in a 

Nutrient medium 

Air bubbles 

Bacterial growth 

Air input 

Figure 5.2 Diagram of a bacterial chemostat. Nutrient medium drips in at the 
top, but a constant volume is maintained by means of an overflow siphon The 
air coming in at the bottom provides oxygen. At the steady state, the rate of 
inflow of nutrient equals the rate ol outflow. Cells within the chemostat are in a 
continuous state of division, but the population does not increase in size 
because, in any interval of time, the number ol new cells produced by division 
is balanced by the number washed out through the siphon. 

Sources of Variation 1 67 

chemostat is shown in Figure 5.3. Note the abrupt increase in mutation rate 
(indicated by the increase in slope) shortly after the addition of caffeine, a 
bacterial mutagen. 

6 x W' 6 


4xuT 6 

2x NT 6 


Caffeine / 
added / 

9 — - "" • 

1 ■ .. 1 ... .-J L 

8 J2 1ft 

Time (t, in generations) 


Figure 53 Estimation of mutation rate in a bacterial chemostat This exam- 
ple concerns the rate of mutation of a gene in Escherichia coli that confers resis- 
tance to infection by the bacteriophage T5 The frequency q, is the frequency of 
T5-resistant cells after t generations of growth. The mutation rate is estimated 
as the slope of the straight-line segments. Prior to the addition of caffeine, 
the slope was u = 7.2 x 10 B per generation. After addition of caffeine at a con- 
centration of 150 rng/1, the slope increased about tenfold to \i = 66 x 10 B per 
generation. In this experiment, the generation time was 5 5 hours (From 
Novick 1955.) 

PROBLEM 5.1 A genetic factor has been described in Drosophila 
mauritkna that results in the spontaneous deletion of the transpos- 
able generic element mariner at a frequency of approximately one 
percent per generation for each copy (Bryan et al. 1987). In a popu- 
lation containing an autosomal site at which a mariner insertion is 
fixed (homozygous), how many generations would be required for 
the frequency of flies that are homozygous for a deletion of the 
element to exceed five percent? Assume that the population is 
large, that mating is random, that the excision factor is fixed, and 
that deletion of the element does not affect survival or repro- 

168 Chapters 

ANSWER Let p t be the frequency of chromosomes in which the 
mariner element remains undeleted in generation t, and let |i = 0.01 be 
the probability of deletion of the element per generation. For this situ- 
ation, Equation 5.1 applies with p = 0.01 and p = 1- The frequency of 
deletion homozygotes is greater than five percent when (1 - p t f > 0.05, 
or p, < 1 - (.05) 1 '* = 0.776. Thus, t should be greater than ln(0.776)/ 
ln(0.99) = 25.2 generations. 

Reversible Mutation 

In this section, in addition to forward mutation ol A to a, we also allow 
reverse mutation from a to A. In this case, the mutation pressure on the allele 
frequency p is in bolh directions: forward mutation tends to decrease p, 
reverse mutation tends to increase;?. Eventually, an equilibrium is reached in 
which the frequency p remains constant from generation to generation. At 
this point, the loss of A alleles from forward mutation is exactly offset by the 
gain of ,4 alleles from reverse mutation. 

To deduce the point of equilibrium, suppose that the rate of forward 
mutation from A to a is u per generation and that the rate of reverse mutation 
from a to A is v per generation. Let p, and q, denote the allele frequencies of A 
and a in generation /, so that p, f q, = 1. An A allele in generation f can origi- 
nate in either of two ways. It could have been an A allele in generation f - 1 
that escaped mutation to a (which happens with probability 1 - p), or it could 
have been an a allele in generation f - 1 that mutated to A (which happens 
with probability v). In symbols, 

P* = Pi-i(1-M) + 0-Pm) v 


To solve equations of this type, a useful trick is to determine whether the 
relation can be expressed in the form p,-A = (p M - A)B, where A and B are con- 
stants dependent only on p and v. Simplifying, we obtain p, - p,- t B + A{\ - B). 
Putting Equation 5.2 into the same form yields p, = p, _ i(l - u - v) + v. Equating 
like terms, we deduce that B = 1 - p - v and A{\ - B) = v. Consequently, 
A — v/(p + v). Hence, we can rewrite Equation 5.2 in the form 

Pi--^— = \p,-i—^— |(1 -p-v) 

(J + V { |JtV 


Sources of Variation 1 69 

Because the relation between p M and Pt-2 is the same as that between p, 
and pi_i, the solution to Equation 5 3 is 

r '-iT7r KuTv f(1 -^ v) ' 


To understand what happens to the allele frequency in the long run, con- 
sider Equation 5.4 in the case when f is very large, for example lCf or 10 6 gen- 
erations. Even though 1 - p - v is ordinarily close to 1, the value of f eventually 
becomes so large that (1 - p - v) f becomes approximately 0. Thus, the whole 
right-hand term in Equation 5.4 goes to 0, and so p, eventually attains a value 
that remains the same generation after generation. Such a value of p is called an 
equilibrium value, which we will denote by p. In case of reversible mutation, 
the equilibrium is found by equating the left-hand side of Equation 5 4 to 0; 

p + V 


The manner in which p, converges to its equilibrium value is shown in 
Figure 5,4 for the case p = 10~ 4 and v = 10" 5 . Note that, whatever the initial fre- 
quency of A, the allele frequency o_f^_e^nju^lly.gi3es-to-p, whichin.this 
example equals 0.00001 /{0.0001+ 0.00001 ) = 0.091 Figure 5.4 also indicates 
that mutation pressure is usually veryi^eaF in changing allele frequency 
inasmuch as the population requires tfroUsandsor tens of thousands-of- gen- 
erations to reach equilibrium. 


2(1,(10(1 30,1 K)0 

Time (I, in generations) 



Figure 5.4 Theoretical change in allele frequency under pressure of reversible 
mutation. The attainment of near-equilibrium values requires lens of thousands 

of generations for realistic mutation rates. In this example, the forward muta- 
tion rate (A -> a) is \i = 10"'" and the reverse mutation rate (a -* A) is v = 10' 1 The 
equilibrium allele frequency of A, calculated from Equation 5.5, is 0.091. 

1 70 Chapter 5 

PROBLEM 5.2 The bacterium Salmonella typhimurium has a genetic 
switching mechanism that regulates the production of alternative 
forms of a protein component of the cellular flagelia. There are two 
alleles, which we will call A {for the "specific-phase" flagellar proper- 
ty) and a (for the "group-phase" flagellar property). Switching back 
and forth between A and a takes place rapidly enough that Equation 5.4 
can be applied. The transition from A to a has a rate of u = 8.6 x 10~ 4 
per generation, and that of a to A has a rate of v = 4.7 x 10~ 3 per gener- 
ation. These rates are orders of magnitude larger than mutation rates 
typically observed for other genes. The reason is that the change from 
A to a and back again does not result from mutation in the conven- 
tional sense but from intrachromosomal recombination (Simon ef al. 
1980). Formally, however, we can treat the system as one with 
reversible mutation. In cultures initially established with the frequen- 
cy of A at p = 0, Stocker (1949) found that the frequency increased to 
p = 016 after 30 generations and to p = 0.85 after 700 generations. In 
cultures initiated with p a = X the frequency decreased to 0.88 after 388 
generations and to 0.86 after 700 generations. How do these values 
agree with those calculated from Equation 5.4 using the estimated 
mutation rates? What is the predicted equilibrium frequency of Al 

ANSWER Note that v/fti + v) = 0.845. This is the predicted equilibri- 
um frequency (Equation 5.5). Also, 1 - u - v = 0.99444, and this quan- 
tity determines the rate of approach to equilibrium. For the cultures 
with pa = 0, the predicted values are p» = 0.845 - (0.845){0.99444) M = 
0.13 and p m = 0.845 - (0.845)(0.99444) 700 = 0.83. For the cultures with 
p = 1, the predicied values are p m = 0.845 + (0.155M0.99444) 388 = 0.86 
and p 7m = 0.845 + (0.155)(0.99444) 700 = 0.85. The predicted values are 
in very good agreement with the observations. 

Probability of Fixation of a New Neutraf Mutation 

The assumption of an infinite population size is not very realistic. In an 
improved model in which the population is finite, the change in frequency of 
a mutant allele depends not only on the mutation pressure but also on ran- 
dom sampling from generation to generation. The sampling process, called 

Sources of Variation 1 71 

random genetic drift, results in chance changes in allele frequency The process 
is illustrated in Figure 5.5. The squares represent the 2N alleles in the adult 
population in generation f. Each allele is assigned a unique label— a,, a 2 , a,, 
. . . , a 2 w— to temporarily mask its identity as either A or a. The circles repre- 
sent the essentially infinite pool of gametes in generation t. In the gamete 
pool, each labeled allele has a frequency of 1/2/V The squares at the bottom 
represent two diploid genotypes in generation / + 1 formed by random 
sampling from the pool of gametes. By chance, the two alleles forming a 

Alleles in breeding 
population in generation 
t- I 


Gametes (each 
type with 
frequency I/2N) 

Generation f or, | a,- 

Probability 1/2N 

1 - 1 /2N 

Figure 5.5 Random sampling of alleles in a finite population increases the 
probability of identity by descent (IBD). Two randomly chosen alleles, illustrat- 
ed in the squares at the bottom, may be IBD either because they are replicas of 
»ne same allele in the immediately preceding generation («,«,) or because rhev 
are replicas of the same allele in a more remote generation («,o,) 

1 72 Chapter 5 

genotype may be replicas of the same allele in the previous generation, for 
example, a,a, Alternatively, the two alleles forming a genotype may come 
from different alleles in the previous generation, for example, a,a r 

The random sampling from the gamete pool means that some alleles may 
he overrep resented in generation t + 1, relative to their frequency in genera- 
tion t, and some alleles may be underrepresented Indeed, any particular 
allele has a good chance of being unrepresented in generation f + 1, and 
hence the lineage of that allele is terminated. To be precise, each allele in gen- 
eration f has a chance of approximately 1/c = 0.368 of not being represented 
in generation / + 1 . To understand why, consider the allele designated cti. The 
frequency of a, in the gamete pool is 1 f2N, and the frequency of all other alle- 
les together is therefore 1 - 1/2W. Because the genotypes in generation t + 1 
are formed by the random selection of 2N alleles from the pool of gametes, 
the distribution of the number of oc, and non-a, alleles present in generation 
f + 1 is given by successive terms in the binomial expansion (Chapter 1): 


1 d M' 

ot| + 1 a 

IN \ 2NJ 


in which oc represents the collection of all alleles other than a t . Hence, the 
probability that ct| is not represented in generation f 4 1 is 

i f 

1-— «l/c = 0368 

2N J 


The -approximation is very good even when N is quite small. For example, 
when N = 10, the left-hand side of Equation 5.7 equals 358, and, when 
N = 20, the left-hand side equals 0.363. 

The important implication of Equation 5.7 is that, owing to random genet- 
ic drift, the ancestral lineage of each allele faces a substantial risk of extinction 
in each generation. As time goes on, the lineages progressively disappear, one 
or a few at a time. Eventually, a time is reached at which all lineages except 
one have become extinct. At that time, every allele in the population is iden- 
tical by descent with a particular allele present in an ancestral population. 

The ultimate extinction of all but one lineage implies the answer to the 
question: What is the probability that a single new mutation eventually 
becomes fixed in a population of size 2N? The reasoning is illustrated in 
Figure 5.6. Parts A and B show all the alleles present in the current genera- 
tion, immediately after a new mutation (shaded circle) has been created. 
After a sufficient number of generations have passed, each of the alleles in 
the descendant population will descend from a single allele, chosen at ran- 
dom, in the current population. In part A, the descendant alleles all derive 

Sources of Variation 1 73 



in current 


Alleles present 

many generations 




in current 


Aliefes present 

morn £ener.i(i(ni 

















o _^ 

-* — » 


o _^ _ 


— 1 







O Probability 



O Probabil 

>y ib • 












Figure 5.6 In a finite population, the lineages of all alleles must trace hack to a 
single allele in some ancestral population. Here, a particular allele of interest m 
a diploid population of size N is indicated by the shaded circle. (A) The proba- 
bility the designated allele is not destined to be the common ancestor of all alle- 
les many generations in the future is 1 - 1 /IN. (B) The probability the 
designated allele is destined to be the common ancestor of all alleles manv gen- 
erations in the future is 1/2N. Hence, the probability of ultimate fixation of a 
newly arising neutral allele is 1 /IN. 

from one of the nonmutants in the current population; the nonmutant alleles 
have frequency 1 - 1/2N, and so this is the probability of ultimate fixation of 
a nonmutant. In part B, the descendant alleles all derive from the mutant, and 
so 1/2M is the probability of ultimate fixation of a new mutant allele. More 
generally, for neutral alleles, which do not affect the survival or reproduc- 
tion of the organism, the probability of ultimate fixation of a selectively neu- 
tral allele in a finite population is equal to the frequency of the neutral allele 
in the initial population 

For the lucky few neutral alleles that are eventually fixed, the process 
takes a long time: on the average, 4N generations 1 he method by which this 
result can be deduced is considered in Chapter 7. 

1 74 Chapter 5 

The Infinite-Aileles Mode! 

Recall from Chapter 2 that many genes have more than two alleles repre- 
sented among the organisms in a natural population It is therefore of some 
importance to determine the expected level of genetic variation under 
mutation pressure A convenient measure of genetic variation is the het- 
erozygosity (the proportion of heterozygous genotypes). If a gene has a 
greater heterozygosity than expected from mutation pressure alone, then 
other forces that operate m nature must tend to preserve genetic variation. 
On the other hand, if a gene has a smaller heterozygosity than expected, then 
other forces must tend to eliminate genetic variation. 

The heterozygosity of a gene is a function of the number of alleles and 
their relative frequencies In principle, the number of alleles of any gene 
could be very large. For example, a gene coding for a protein of 300 amino 
acids has a coding sequence 900 nucleotides in length. Because each nucleo- 
tide site could be occupied by either an A, T, G, or C, the total number of pos- 
sible alleles is 4 9m , which equals about W 542 . Hence, we can suppose that 
every new mutation creates an allele that does not already_exisLialhe popu- 
lation. This is called the infinite-alleles model of mutation. The infinite-alle- 
les model is but one way to specify the cbafclcferrstics of new mutations. 
Although it represents a somewhat simplified view ol mutation, it neverthe- 
less provides a useful standard of comparison for other models or for 
observed allele frequencies. 

In the infmite-alleles model, two alleles that are identical by state must 
also be identical by descent because of the assumption lhat each mutation 
creates a unique allele Hence, in this model, homozygous genotypes must be 
autozygous. To measure the homozygosity, therefore, we need to calculate 
the autozygosity. This can be done with reference to the finite-population 
model Figure 5 5. As in Chapter 4, we let F, be the probability that, in gener- 
ation I, two alleles randomly chosen from a population are identical by 
descent. In the context of Figure 5.5, the randomly chosen alleles are com- 
bined in pairs to make genotypes, and so f, is also the probability of autozy- 
gosity in generation f. We will use the a, a, and a,oc, genotypes in generation I 
in Figure 5.5 to derive an expression for F, in terms of F,_,, N, and the muta- 
tion rate u. First, consider the genotype a,ra, What is the probability that this 
genotype has alleles that are identical by descent 7 The alleles must be identi- 
cal by descent provided that neither allele has mutated in the course of one 
generation, and so the probability of identity by descent in this case is 
f I - p) 2 . Now consider the genotype a,a,. These alleles are identical by 
descent only if two randomly chosen alleles in generation f - 1 are identical 
by descent, and if neither allele mutated in the course of one generation, and 
so the probability of identity by descent m this case is F,^(l - p) 2 . Because 
each of the labeled a's in Figure 5.5 has the same frequency in the gamete 
pool (namely, 1 /2JV), the probability of a combination like a,oc, is 1 /2N and 

Sources of Variation 1 75 

the probability of a combination like a,ct, is 1 - 1 /2N. Putting all this togeth- 
er, the recurrence equation for F f is 



Eventually an equilibrium value of F, call it F, is attained in which the 
increase in autozygosity from random genetic drift in any generation is 
exactly offset by the decrease in autozygosity from new mutations. The equi- 
librium can be found by equating F t = F M = F in Equation 5.8 and solving. 
Ignoring terms in p 2 and those in u/N because they are expected to be negli- 
gibly small, the solution is 

F = - 


l + 4Np 


to an excellent approximation. Therefore, the number of selectively neutral alle- 
les increases under mutation pressure.untiL F satisfies Equation 5.9. Being the 
equilibrium value of the probability of identity by descent, F is also the equi- 
librium value of the ^utozygosjity: Because of the assumption in the inf mite-alle- 
les model that each allelejrjjhe population arises only once, all genotypes that 
are homozygotes must als o be au toz yg ous. Therefore, F can also be interpret- 
ed as the equilibrium value of the proportion of homozygous genotypes. 

It is an odd feature of Equation 5.9 that it gives the equilibrium homozy- 
gosity of a population without explicit reference to allele frequencies The 
natural way to write the homozygosity expected with random mating for n 
alleles with frequencies pi, pj, p 3 , . . . , p„, is 

£p, 2 =PiWj +- + />» 


We thus have two expressions for the equilibrium homozygosity in the 
forms of Equatons 5.9 and 5.10. Because the two equations refer to the same 
thing, they must equal each other, and so Ep 2 = F = 1 /(4Nu + 1 ) Alternative 
approaches leading to essentially the same result are discussed in Sved and 
Latter (1977). 

The homozygosity is the proportion of homozygous genotypes in a pop- 
ulation; the heterozygosity is the proportion of heterozygous genotypes. 
Hence, homozygosity and heterozygosity are opposite sides of the same coin. 
Therefore, if the homozygosity in a population is given by F = 1 /(4rVp + 1), 
then the heterozygosity is given by 1 - F = 4Np/(4Nu + I). These functions 
for the equilibrium homozygosity and heterozygosity are plotted against 
4Mu in Figure 5.7. The illustration shows that there is a rather narrow range 

176 Chapter 5 

4 6 

Value of 4Nm 

Figure 5-7 Plot of average homozygosity and average heterozygosity for the 
innnite-alleles model. Intermediate values of heterozygosity are maintained 
over only a small range of 4Nu 

of 4Nu over which an intermediate level of genetic variation (heterozygosity) 
is maintained. For example, the equilibrium heterozygosity is in the range 0.2 
to 0.8 only when 4/Vu is in the range 0.25 lo 4. 

A complication in the interpretation of Equation 5.10 is that any number 
of distributions of allele frequency can result in the same homozygosity. For 
example, a population in HWE with the four alleles at frequencies p\ = 0.7, 
p 2 = 0.1, p-f = 0.1, and p 4 = 0.1 has a homozygosity of Ip, 2 = 0.52; likewise, a 
population in HWE with two alleles at frequencies p\ = 0.6 and p 2 = 0.4 also 
has a homozyogosity of 52. The problem that many distributions of allele 
frequency tan result in the same homozygosity can be sidestepped by assum- 
ing that all alleles are equally frequent. If the population contains n equally 
frequent alleles, then p x = p 2 = p-\ = . . . = p n = l/rt; the homozygosity is calcu- 
lated from Equation 5.10 as Xp/= h(1/h) 2 = 1/n. At equilibrium, therefore, 
1 fn - F = 1 /{4Wu + 1 ), or n = 4Nu + 1 . The number n of equally frequent alle- 
les is called the effective number of alleles, often symbolized as n c . Diverse 
distributions of allele frequency can be compared in terms of their effective 
number of alleles. Biologically speaking, h c is the number of equally frequent 
alleles that would be required to produce the same homozygosity as 
observed in an actual population. In the examples given at the beginning of 
this paragraph, the four-allele population and the two-allele population with 
identical homozygosities of 0.52 also have the same effective number of alle- 
les, namely n r = 1/0.52 = 1.92. 

Sourc es of Variation 1 77 

PROBLEM S.3 An aliozyme study of a Caribbean population of 
Dmsophih wiWstoni (Ayala and Tracy 1974) yielded the following esti- 
mated allele frequencies for the loci Adk-1 (adenylate kinase-1), Lap-5 
(leucine amino peptidase-5), and Xdh (xanthine dehydrogenase). 
Adk-1 Lap-5 Xdh 

Allele 1 




Allele 2 




Allele 3 




Allele 4 




Allele 5 




Allele 6 




Allele 7 







Estimate the effective number of alleles of each gene. 

ANSWER The effective number of alleles is estimated as the recip- 
rocal of Ipi 2 . For Adk-1, n e = 2.28; for Lap-5, n e = 1.49; and for Xdh, 
« e = 2.68. Note that the effective number of alleles is determined more 
by the uniformity of allele frequencies than by the actual number of 
alleles. For example, Lap-5 has more actual alleles than Adk-1 but a 
smaller effective number of alleles. 

Neutral Mutations 

The hypothesis that many genetic polymorphisms result from selectively 
neutral alleles maintained by a balance between the effects of mutation and 
random genetic drift is known as the neutral theory or the theory of selec- 
tive neutrality (Kimura 1968; King and Jukes 1969). Mutation introduces 
new alleles into a population, and random genetic drift determines whether 
a neutral allele will ultimately be fixed or lost. (Loss is the usual outcome.) 
At equilibrium, there is a balance between mutation and random genetic 
drift, so that, on the average, each new allele gained by mutation is balanced 
against an existing allele that is lost (or, more rarely, fixed). The balance point 
for the homozygosity in the infinite-aUeles model is given in Equation 5 9. 

In essence, the neutrality hypothesis states that many mutations have so 
little effect on the organism that their influence on survival and reproduc- 
tion is negligible. The frequencies of neutral alleles are not, therefore, 

1 78 Chapter 5 

determined by natural selection. Consequently, if the neutrality hypothesis is 
true, then many polymorphisms may have no particular significance in the 
adaptation of a species to its environment From the perspective of adapta- 
tion, selectively neutral polymorphisms are mere evolutionary "noise" and, 
regardless of how much their study may reveal about population structure 
and random genetic drift, they tell us lit tie or nothing about adaptive genetic 
changes in evolution. Kimura (1968) gave the irony a positive spin by noting 
that "if my chief conclusion [about the prevalence of neutral alleles] is correct, 
then we must recognize the great importance of random genetic drift ... in 
forming the genetic structure of biological populations." Quite so. Indeed, 
while neutral alleles are unsuitable for the study of genetic adaptation, Ihe 
very fact that they are invisible to natural selection makes them ideal for 
mapping the geographical structure of populations and for tracing the ances- 
tral lineages of DNA sequences to make inferences about the phylogenetic 
relationships between species 

Because the neutrality hypothesis is of fundamental importance in popu- 
lation genetics and evolution, it has been a subject of considerable discussion. 
The neutrality hypothesis was put forward in the late 1960s at a time when 
most of the genome was supposed to have a protein-coding function. Introns 
and other noncoding sequences were unknown. Today it is clear that only 
about 4 percent of the mammalian genome codes for proteins. The low cod- 
ing density affords ample scope lor mutations that have little or no effect on 
fitness, including some (but by no means all) mutations in introns, pseudo- 
genes, spacers between genes, noncoding DNA in the centromeric region of 
chromosomes, and so forth. 

There is still considerable controversy whether amino acid polymor- 
phisms are selectively neutral or nearly neutral. To assess the plausibility of 
the neutrality hypothesis, many aspects of the model must be compared with 
the situation in actual populations. One aspect of the hypothesis developed in 
Ihe preceding section concerns the homozygosity to be expecled with the 
infinite-alleles model. Using an observed allozyme homozygosity, we 
can estimate the effective number of alleles n c and, from the expression 
n v = 4Nu + 1 , estimate the corresponding value of Nu. If the resulting values 
are grossly unreasonable, we can safely reject the infinite-alleles version of 
the neutrality hypothesis (or at least argue that actual populations cannot be 
in equilibrium). 

Recall from Chapter 2 that observed values of heterozygosity of allozyme 
genes range from 0.04 to 14 in most organisms (see Figure 2 9). Observed 
homozygosities therefore range from 1 - 04 = 0.% to 1 -0.14 = 0.86, which 
corresponds to estimated n, in the range 1 /0.96 = 1 .04 to 1 /0.86 = 1 16. Esti- 
mates of Nu, calculated as (n, - 1)/4, therefore range from 0.01 to 0.04. The 
fact that the maximum estimated value of Nu differs from the minimum by 
a factor of only about four is surprising, inasmuch as the population number 

Sources of Variation 7 79 

in Afferent species ranges over a factor of 10 4 or more. The apparently too 
uniform datnbuhon of allozyme homozygosities among dive se organise 
has been interpreted as implying that the neutrality hypothesis is wrong m 
ammo and polymorphs. On the other hand, estimates of the population 
number m natural populations are generally imprecise because the studies 
are very difficult, and estimates of u, which in this case is the mutatn n r Veto 
neutral alleles, are even more uncertain. 

Figure 5.8A shows a second type of test of the adequacy of the neutrality 

genes. The shaded histogram is the observed distribution of heterozvgosity 

t ZnTTT^ W T7 S ,- ^ hiSt ° gram ° Ut,ined '" so,id li ™ * * <imp J 
er-generated theoretical distribution expected with the infinite-alleles model 

lo e Jl rnmi*^ heter °W sit y is °- 099 ' ^d the theoretical heterozy^ 
gosity is 0.091. The correspondence between the histograms is fairly good 



~ 0.50 


£■ 0.04 - 

0.1 0.2 3 4 0.5 0,6 o n1 ( , 2 

Heterozygosity AvcMge hcfero2yg0 ^ y 

* Mammals (33 .species) 

* Birds (2 .species, 1 subspecies) 
° Fish (18 species, 1 subspecies) 

* Lizards (21 species.) 

- Amphibians Q species, 1 subspecies) 

Figure 5.8 (A) Observed distribution of allozyme heterozygosity amone 

El? U i a , S r S f a D ded)aJ0n S With theoretol distribution forSJhve 
neutrality (solid lines). (B) Mean and variance of heterozygosity amone 

ni teTeKSel T^' ^ B ° Hd '** » ^ *»^«l ™S fofthe ,„fl- 
I tch a mSS r Z^ mUtat,m rafe to neUtraI alMe * varies among gone* 
in such a manner that the va nance in mutation rate equals the square of the 
mean mutation rate. (After Nei et al 1976.) q 

180 Chapter 5 

but the observed distribution seems to include too many genes with het- 
erozygosities in the range of 0.35 to 0.55 (For a possible ex phi nation, see 
Fuersteta! 1977.) 

A third type of test of the neutrality hypothesis is shown in Figure 5.8B, 
which presents data on the mean and variance of heterozygosity in 77 verte- 
brate species. The curve is the theoretical expectation from the infinite-alleles 
model when the rate of selectively neutral mutation varies among genes (Nei 
et ai. 1976). At first glance, the fit in Figure 5 8B is impressive. On the other 
hand, the observed points are sufficiently scattered that any number of other 
curves might fit at least as well. Evidently, statistical comparisons of this sort 
are too lacking m power to distinguish between the hypotheses, 

A brief consideration of the phrase lacking in power may be in order. The 
neutral theory is useful in being a sort of starting point, or null hypothesis, 
which provides predictions about the relationships among observed quanti- 
ties that can be confirmed or rejected. Statistical tests of the neutral theory are 
similar to other types of statistical tests in that two distinct types of possible 
errors must be balanced. If the tests are too demanding (for example, in fail- 
ing to allow for the effects of random sampling error), then data may often 
result in rejection of the hypothesis even when it is true. False rejection is 
called Type I error. On the other hand, if the statistical test allows too much 
latitude in the data, then data will seldom result in rejection of the hypothe- 
sis even when it is false. False acceptance is called Type II error. The tradeoff 
between Type 1 error and Type II error is that the probability of Type I error 
cannot be decreased without increasing the probability of Type II error, and 
vice versa. By convention, statisticians usually adopt a 5 percent criterion for 
rejection of the null hypothesis even when it is true. This is the familiar 5% 
level of statistical significance, and it means that there is a 5% chance of 
rejecting a true hypothesis (Type I error). With this convention, the probabil- 
ity of a Type 11 error (failing to reject a false hypothesis) falls where it may, 
and a test with a relatively high probability of Type II error is said to be lack- 
ing in power. 

Although the comparisons in Figure 5.8 are lacking in power and hence 
are inconclusive in their support of the neutrality hypothesis, many other 
observations and types of data have been brought to bear in assessing fhe 
hypothesis. These data often rely on comparison of nucleotide sequences of 
DNA in different genes or in different species. These types of comparisons 
and the conclusions from them are discussed further in Chapter 7. 


In the context of genetic variation, the importance of recombination is that it 
allows linked alleles to become associated in many different combinations. 
In a random mating diploid population, as discussed in Chapter 3, linked 

Sources of Variation 181 

alleles come into random association {linkage equilibrium} at .i rate deter- 
mined by the frequency of recombination r (Equation 3.8). If r is .small, it may 
require many generations for linkage equilibrium to be attained For exam- 
ple, the average rate of recombination between adjacent nucleotides in 
Drosophila is 2.7 x 10 8 , with wide variation in different parts of the genome, 
and so nucleotide polymorphisms in the same region of the genome are 
often in linkage disequilibrium. Consequently, the ultimate fate of a new 
mutation may depend to a considerable extent on the effects of other poly- 
morphisms with which it is very closely linked. The effect of recombination 
on the fate of genetic variation is the subject of this section. 

Presumed Evolutionary Benefit of Recombination 

Evolutionary biologists have long taken it for granted that recombination is 
important in evolution because it accelerates the rate of formation of benefi- 
cial gene combinations. A graphical representation of the process is illustrat- 
ed in Figure 5.9. In part A are two large populations, one with no recombi- 
nation (an asexual species) and one with recombination (a sexual species). 
Each has three favorable mutations, a, b, and c, which ultimately become 
incorporated into the genome. In the asexual species, the mutations are 
incorporated sequentially because each favorable mutation must take place 
in the genetic background of the one before. The process is slow because 
each favorable mutation must be nearly fixed before there is a high chance 
that the next favorable mutation takes place in the proper genetic back- 
ground. In contrast, in the sexual population, there is no such problem. 
Recombination between the genes allows that triple mutant abc to be formed 
almost immediately. 

The evolutionary advantage of recombination outlined in Figure 5.9A 
does not apply as strongly to the small populations in Figure 5.9B. In a small 
population, three favorable mutations are unlikely to be present simultane- 
ously, and so the fixation of the favorable alleles proceeds sequentially in a 
sexual as well as in an asexual species. 

Recombination and Polymorphism 

Because recombination between adjacent nucleotides is infrequent, nearby 
nucleotide sites tend to evolve together. Owing to genetic linkage, forces that 
tend to maintain genetic diversity or that Lend to reduce genetic diversity 
will act regionally. Therefore, the level of polymorphism found in any region 
ol the genome is expected to be correlated with the level of polymorphism in 
a closely linked region. Evolutionary forces thus leave their mark on the level 
and type of genetic variation found within closely linked regions of the 

In D melanogaster, an important pattern of genetic polymorphism associ- 
ated with degree of linkage is illustrated in Figure 5.10. A region of the 

182 Chapters 

(A) Large population 


(B) Small population 

Fiqure 5 9 Evolutionary effect of recombination (A) In a large population of 

an asexual species with no recombination (top panel), the favorable mutations a, 
b and c must he incorporated into the genome sequentially because there is no 
mechanism to brine the favorable mutations together; each favored mutation 
must reach a high frequency to have a reasonable chance that the next favorable 
mutation will take place in the proper genetic background. With recombitiaUon 
fboltom panel), recombination between the favorable genes enables the nple 
mutant lib t to be formed very rapidly (B) The beneficial effect of recombination 
is diminished in a very small population because, in a small population, multi- 
ple favorable mutations are unlikely to be present simultaneously. (From Crow 
and Knnura 1970.) 

Sources of Variation 183 

0012 r 














004 06 

Rate of recombination 

Figure 5.10 Observed relation between the level of nucleotide polymorphism 
and the rate of recombination in Drosophila. (From Aquadro ef al. 1994 ) 

genome in which the rate of recombination per nucleotide is reduced, such as 
near the tip or near the base of each chromosome arm, also tends to have a 
reduced level of genetic polymorphism even though the rates of mutation are 
uniform across I he chromosome (Aquadro et al. 1994). In Figure 5.10, the 
level of polymorphism is expressed as the proportion of nucleotide sites that 
are polymorphic (called 6 in Chapter 2). For the regions plotted, 6 ranges over 
more than a factor of 10, so there is clearly an important effect of close linkage 
in reducing the level of polymorphism. 

In theory, the reduction in the level of polymorphism in regions of tight 
linkage could be explained by either of two diametrically opposed mecha- 
nisms. In one mechanism, the reduction results from the fixation of favor- 
able mutations. In the other mechanism, the reduction results from the 
elimination of harmful mutations. These explanations have somewhat differ- 
ent implications for the pattern of polymorphism in regions of tight linkage, 
and so they can be distinguished experimentally. 

Consider first the consequences of fixation of a favorable mutation. On its 
way to fixation, any new favorable mutation may carry along a small sur- 
rounding region of the genome and render the region monomorphic. The 
monomorphism will not usually be complete. Some degree of polymorphism 
may remain in the region, either because new mutations happen in the 
process of fixation or because of rare recombination events that take place 

184 Chapters 

The process in which a favorable mutation becomes fixed in a population is 
called a selective sweep. During a selective sweep of a favorable allele, any 
neutral alleles sufficiently tightly linked go along for the ride and are said to 
be hitchhiking. The main effect of hitchhiking is that a small region around 
the favored allele will be overrepresented in the population. In other words, 
there will be an apparent excess of rare genetic variants owing to the over- 
representation of the region that profited from the hitchhiking 

Consider next the consequences of a harmful mutation. For concreteness, 
consider the genetic map diagrammed in Figure 5.11 A, in which the short 
vertical lines indicate adjacent nucleotide sites One site that can undergo 
neutral mutation is embedded in the middle surrounded by sites that can 


L/= Zu 

R= Er 


Neutral site 

1.0 r 

0.1 0.2 3 04 

Recombination frequency across region (R) 

Figure 5. 11 Effects of background selection on nucleotide polymorphism (A) 
A region of a chromosome containing a set of genes (tick marks) that can mutate 
to detrimental alleles; within this set of genes is a single neutral site. The muta- 
tion rate per locus is u and the rate of recombination between adjacent loci is r. 
(B) Relative nucleotide diversity as a function of U, the total mutation rate, and 
R, the total recombination rate, across the chromosomal region Note the posi- 
tive correlation between level of nucleotide polymorphism and rate of recombi- 

Sources of Variation 185 

undergo harmful mutations only. The rate of harmful mutation per site per 
generation is denoted u, and the rate of recombination between ad|acent sites 
is denoted r. 

Suppose further that each mutation, even when heterozygous, is suffi- 
ciently harmful that any chromosome in which a mutation is present is ulti- 
mately doomed, In the absence of recombination, the fate of a chromosome 
depends on whether it is free of harmful mutations because, under our 
assumptions, no chromosome can persist for long unless it is free of muta- 
tions. The effect of harmful mutation, which in this context is called back- 
ground selection, is to reduce the number of chromosomes that can 
contribute to the ancestry of remote generations. Indeed, the effect of back- 
ground selection is identical to that of a reduction in population size except 
that the reduction applies, not to the genome as a whole, but to a tightly 
linked region (Charlesworth et al. 1993). Background selection therefore 
reduces the level of genetic polymorphism. Looser linkage means that a 
linked neutral mutation can escape the fate of a harmful neighboring muta- 
tion by recombination with a mutation-free chromosome. Hence, the tighter 
the linkage, the greater the reduction in polymorphism due to background 
selection. Although there is a reduction in the level of polymorphism, back- 
ground selection does not skew the distribution of rare polymorphisms 
because, for all practical purposes, the harmful allele merely causes one chro- 
mosome to drop out of the population, much as if it were to go extinct by 
chance (Braverman et al. 1995). 

Although the evidence is not yet conclusive, the model of background 
selection appears to provide a better explanation of the Drosopfiila data than 
does the model of selective sweeps (Hudson and Kaplan 1995; Charlesworth 
et al. 1995). The evidence is that rare nucleotide polymorphisms are found at 
a frequency that would be expected given the overall level of polymorphism 
(Braverman et al. 1995). There is no evidence for a skewed distribution 
toward rare variants that the model of selective sweeps would predict. 

The effect of background selection on the level of genetic variation is 
shown graphically in Figure 5.11 B for the genetic map diagrammed in part A. 
The curves are plotted from the formula 

ji = h„c- u '< 2 '" +i ' 


(Hudson and Kaplan 1995). The symbol n is the nucleotide diversity, defined 
as the average proportion of nucleotide differences between all possible pairs 
of sequences (Chapter 2); re,, is the value of n in the absence of background 
selection. U and R refer to the diagram in part A. U is the total mutation rate 
per d iploid genome, summed across all genes in the region; and R is the total 
rate of recombination across the region, summed over each of the intervals 
between genes. The quantity hs measures the degree of harmfulness of each 

186 Chapters 

deleterious mutation in a heterozygous genotype; the extremes are Its = 0, 
when there is no effect in the heterozygote, nnd lis = 1, when the heterozy- 
gote is lethal. The mode! on which Equation 5.11 is based includes the 
assumption that lis is small but not 

The curves in Figure 5 11 B are for the specific value hs = 0.02, which 
means that a genotype that is heterozygous for one deleterious mutation has 
a 2% reduction in survival compared with a homozygous nonmutant. For 
each curve, the relative nucleotide diversity (jt/n„) decreases as the total 
recombination rate R decreases. This result means that, with tighter linkage, 
each detrimental mutation that is eliminated takes with it a larger surround- 
ing region of chromosome. The relative nucleotide diversity also decreases as 
the total mutation rate increases; that is, greater background selection elimi- 
nates a greater number of chromosomes. Together, tight linkage and a mod- 
erate or high total mutation rate can result in a very substantial decrease in 
relative nucleotide diversity, reducing it to a level of 20% or less of that 
expected in the absence of background selection. In view of the reduction in 
genetic variation in regions of reduced recombination observed in Dwsophila 
(Figure 5.10), the implication of Equation 5.11, along with the absence of a 
skewed distribution toward rare variants, suggests that much of the effect 
results from background selection. 

Piecewise Recombination in Bacteria 

Many prokaryohc organisms make use of mechanisms of recombination in 
which a piece of DNA that is small, relative to the size of the entire genome, 
is transferred from a donor cell into a recipient cell These mechanisms 
include transformation, in free DNA is taken up by the recipient from 
the surrounding medium; transduction, in which a DNA fragment is carried 
from the donor to the recipient by means of a virus particle; and conjugation, 
in which a replica of the chromosome from a donor cell is transferred into a 
recipient cell by a gradual process requiring cell-to-cell contact, but the chro- 
mosome usually breaks before the transfer is complete. Because relatively 
short patches ol the genome participate in recombination, these processes 
differ in their evolutionary implications from meiotic recombination in 

The main effect of short-patch recombination is that long-range linkage 
disequilibrium tends to be maintained For example, in enteric bacteria, such 
as Escherichia colt, which are part of the normal intestinal flora, linkage dise- 
quilibrium between allozyme loci is very strong (Whittam et al. 1983). At the 
level of DNA sequence, however, many genes have an obviously mosaic 
structure in which different segments have different phylogenetic histories 
(DuBose et al. 1988). An example from the phoA gene, coding for alkaline 
phosphatase in E coll, is illustrated in Figure 5.12. Among the polymorphic 
nucleotide sites indicated, the unique nucleotide at each site is inscribed in a 
box. At the extreme ends of the gene, the alleles from strains RM217T and 

Sources of Variation 1 87 

Nucleotide Mt-e in phoA gene 

I 1 1 I I 1 I 1 1 i I I [ | | , ,, , , , , 
Allele 6 8 ° n ° ° " 4 4 4 4 4 5 5 5 5 fi 7 7 7 8 K 

2 3 5 6 6 7 7 8 2 2 7 7 9 2 5 6 8 16 8 2 5 
7191847l5fi497<34 10 3 2926 
RM277T C A[G]A}C G A C ^(^TffcTf ] TTtTTJ 7 TCAAT 
RM224H C0C G[l]A]Glf]T C A C T C C C C [cIcIaIi]t]c] 

Figure 5 12 _ Evidence for recombination in the phoA gene in natural isolates of 
E cah. The pair of strains at the top are more similar at the beginning and end of 
die gene, the pair of strains at the bottom are more similar in the central region 
There is significant clustering of the nucleotide sites inscribed in boxes as 
expected from recombination. (Data from DuBose et al, ] 988.) 

RM45E are the most closely related; in the middle of the gene, from nucleo- 
tide sites 1425 to 1560, there is a run of polymorphic nucleotides in which the 
similarity between RM217T and RM45E is lost, as if this part of the gene had 
been introduced by recombination with a more distantly related allele 
Although short runs of similar or dissimilar nucleotides can also be the result 
of chance, chance effects can be ruled out by appropriate statistical tests for 
recombination (Stephens 1985; Sawyer 1989). 

The finding that many genes have a mosaic ancestry through recombina- 
tion seems at first to contradict the finding of signincant linkage disequilib- 
rium between more widely separated genes. The paradox ,s resolved by the 
fact that each recombination event is local; it replaces a relatively short stretch 
of the recipient chromosome, and the linkage phase between more distant 
alleles is maintained. The E. coli chromosome, therefore, consists of clonal 
segments from a common ancestor, which is called the clonal frame (Milk- 
man and Bridges 1990, 1993), interrupted by short segments derived from 
recombination with diverse other clones. Even though the clonal frames are 
interrupted by relatively short recombinant segments, their integrity would 
ultimately be lost unless (here were occasional selective events favorine par- 
tic u la r gen o types. K 

Absence of Recombination in Animal Mitochondrial DNA 

Studies in animal population genetics often focus on the DNA of mitochon- 
dria. The mitochondrial genome is informative about parentage because in 
most species of animals, it is maternally inherited and does not undergo 
recombination. It is also a small molecule present in abundant quantities in 
most cells. In animals, mitochondrial DNA (mtDNA) is a circular molecule 
typically ,n the range from 15 to 20 thousand base pairs in length. It codes 
tor fewer than 40 genes; approximately half code for nbosomal RNA or for 

188 Chapters 

transfer RNA used in mitochondrial protein synthesis, and the remaining 
genes code for proteins used in electron transport or oxidative phosphoryla- 
tion. In many species, including mammals, parts of the mtDNA sequence 
evolve very rapidly in comparison with nuclear genes, and hence mtDNA 
can often be used to make inferences about population structure and recent 
population history, 

An example of the utility of mtDNA in population studies is illustrated in 
Figure 5.13, which summarizes the result of examining the mtDNA of 87 
pocket gophers, Geomi/s pinetis, collected across the geographic range of the 
species in Alabama, Georgia, and Florida (Avise et al. 1979). The mtDNA 

Figure 5. 13 Lineage relationships between mtDNA types in pocket gophers. 
The lowercase letters are different mtDNA types grouped according to similari- 
ty and superimposed on a geographical map of the collection sites. The tick 
marks across the connecting lines arc the numbers of inferred mutational steps. 
(From Avise 1994) 

Sources of Variation 1 89 

from each gopher was digested in turn with each of six restriction enzymes, 
each cleaving the DNA at a different six-base recognition sile. The resulting 
restriction fragments were separated by electrophoresis and compared 
among the animals to estimate the number of nucleotide differences affecting 
the restriction sites. 

Among the 87 gophers, there were 23 distinct types of mtDNA, repre- 
sented by the lowercase letters in Figure 5 13. Each of these types represents 
a maternal mtDNA lineage, distinct from other lineages. Animals that share 
an mtDNA type must have a female ancestor in common. The branching net- 
work in Figure 5.13 estimates the matriarchal phytogeny of the mtDNA. The 
straight lines connect related types of mtDNA, and the number of slashes 
across each line indicates the estimated number of nucleotide differences in 
the restriction sites between the mtDNA types. Groups of related mtDNA 
types are enclosed in thin black lines; the thickest lines delineate a western 
and an eastern subpopulation of gophers whose overall mtDNA sequence 
differs by an estimated 3%. Between the eastern and western subpopulations, 
there are 9 nucleotide differences among the sites cleaved by the restriction 

The mtDNA network in Figure 5.13 also resolves population subdivision 
within the western and eastern subpopulations. This subdivision is indicated 
by the mtDNA types circumscribed by the thin black lines. Some of the 
mtDNA types such as "k" and "p" are widespread, whereas others such as 
"b" and "q" are more local in their distribution. The local clones usually dif- 
fer from the most widespread mtDNA type in the region by only one or two 
nucleotides among the sites cleaved by the restriction enzymes. The example 
in Figure 5.13 shows that, because of matrilineal inheritance and the absence 
of recombination in mtDNA, the network of mtDNA types can reveal a great 
deal about population substructure in natural populations. 


In a subdivided population, random genetic drift results in genetic diver- 
gence among subpopulations. Migration, which refers to the movement of 
organisms among subpopulations, is a sort of genetic glue that holds sub- 
populations together genetically and that sets a limit to how much genetic 
divergence can take place. To understand the homogenizing effects of migra- 
tion, it is useful to study migration in several simple models of population 

One-Way Migration 

When migration takes place predominantly from one population into anoth- 
er, without an equal amount of migration in the reverse direction, then there 
is said to be one-way migration. An illustration of one way migration 

190 Chapters 


Allele frequency of A = f. 
Allele Fi equency of n = (] 

Figure 5.14 Model of one-way migration from a large land mass onto an 
island. The allele frequencies in the source population, p* and if, are assumed to 

remain constant, whereas those in the recipient population, p, and rj f , change 
with time. 

between a large mainland population and a small island suhpopulation is 
shown in Figure 5.14. For simplicity, we consider a gene with two alleles, A 
and a, with respective frequencies p* and </* on the mainland and p and q on 
the island. Suppose that, in any generation, a proportion in of zygotes in the 
island subpopulation originates as a random sample of organisms from the 
mainland. Then, if p and p' are the frequencies of A in the island subpopula- 
tion in two successive generations, it follows that 

p' = (1 -m)p+ mp" 


In Equation 5.12, m is called the migration rate between the mainland 
and the island. Subtracting/?* from both sides of Equation 5.12 and simplify- 
ing leads to the expression p' - p* = (1 - »,)(;> - p*); from this expression it fol- 
lows immediately that p, -;>* = ( 1 - w)'(Po - p*), where p, is the frequency of A 
in the island subpopulation in generation t, Flence, 

p, =p' +(l-m)<(p„-p*) 


Equation 5.13 expresses mathematically what should be clear intuitively: 
With one-way migration, the allele frequency of A in the island subpopula- 
tion gradually approaches that of the mainland population, and the rate of 
approach is m per generation. As a check on Equation 5.13, note that, when 
t = 0, then p, - p a , as must be the case, and as t becomes large, p, -> p*. 

As an evolutionary process that brings potentially new alleles into a pop- 
ulation, migration is qualitatively similar to mutation. The major difference is 
quantitative: Generally speaking, the rate of migration among subpopula- 
Hons of a species is vastly greater lhan the rate of mutation of a gene. The 
contrast is illustrated in Figure 5.15 for the unrealistic case in which the A 

Sources of Variation 191 

100 200 300 400 500 

Time (/, in generations) 

Figure 5.15 Change of allele frequency with one-way migration assuming 
that an allele A is initially fixed in the recipient population and absent in the 
source population. The migration rate is m = 0.01. Note that this is the same 
curve as in Figure 5.1 except that the horizontal axis is compressed to 500 gener- 
ations. The time scale is different because, generally speaking, the migration 
rate in is much larger than the mutation rate u. 

allele present in an island subpopulation is absent on the mainland, hi this 
case, Equation 5.13 becomes p, = p„(l - m)' f which has the same form as Equa- 
tion 5.1 for one-way mutation except that m replaces u. The identity in the 
shape of the curves is apparent, but the time axis in Figure 5.1 5 is compressed 
because, when m = 0.0!, as in this example, compared with the value of 
u = 0.0001 in Figure 5.1, it requires only one generation of migration to change 
the allele frequency to the same extent as 100 generations of mutation. 

Equation 5.13 holds more generally for one-way migration by letting pbe 
the frequency of any allele in the population that receives the migrants and p* 
be the frequency of the same allele in the population that supplies the 
migrants. Application of this equation to estimating the amount of genetic 
migration in certain human populations makes use of the allele-frequency 
data given in Problem 4.4 (page 126). The data pertain to blacks and whites in 
Claxton, Georgia, and blacks in West Africa. The case of the MN blood 
groups serves as an example, fn West Africa, which for the purpose of 
this problem may be regarded as the ancestral black population, 
P« = 0.474 for the allele frequency of M. In present-day Claxton blacks, p, = 
0-484. The Claxton while population may reasonablv be regarded as repre- 
sentative of the source of the migrants, and for Claxton whites, p* = 0.507 
Blacks came into (he United States on a large scale from West Africa about 
300 years ago, hence f is about 10 generations. Substituting these estimates 

192 Chapters 

inlo Equation 5.13, wc obtain 484 = 507 + (1 - m)'"(0 474 - 0.507), from 
which we infer that m = 0.035 per generation This estimate can be interpret- 
ed as implying that, in the genetic history of the population of Claxton 
blacks, about 3.5% of the alleles of the MN gene in any generation were 
newly introduced by genetic migration from whites. The apparent amount of 
migration estimated by this method differs from one locus to the next. It also 
differs according to the geographical region in which the white and black 
populations reside. 

PROBLEM 5.4 Estimate the amount of migration from whites to 
blacks using allele frequencies for each of the other genes in Problem 
4.4 in (page 126). 

ANSWER Ss blood group: m = -0.013 per generation; Duffy: m * 
0.011; Kidd: m = -0.028; Keil: m = -0.005: G6PD, m = 0.039: hemoglo- 
bin p: m= 0.071. 

Problem 5.4 illustrates some of the difficulties in estimating racial admix- 
ture from allele frequencies. The positive values of m vary widely, and the 
negative values are not consistent with the proposed model of migration. 
Cavalli-Sforza and Bodmer (1971 ) remark that "The weakness of the analysis 
is mostly due to the uncertainty of the origin of black Americans ... and the 
variability of gene frequencies in the probable area of the slave markets in 
West Africa. In addition, it is unavoidable that gene frequencies have 
changed somewhat from their original values, due to drift or, in some cases, 
selection. The opportunities for admixture, and the time available for it, must 
also have varied widely." The most reliable gene among those in Problem 5.4 
is probably that for the Dufly blood groups because the Fy" allele is virtually 
nonexistent in all of West Alrica. For this gene, the estimate of m is about one 
percent per generation, a result that is consistent with the average value for a 
large number of other genes (Cavalli-Sforza and Bodmer 1971). 

The Island Model of Migration 

In the island model of migration, a large population is split into many sub- 
populations dispersed geographically like islands in an archipelago. 
Examples of island population structure might include fish in freshwater 
lakes or slugs in dispersed garden plots. Each subpopulation is assumed to 

Sources of Variation 193 

be so large that random genetic drift can be neglected Consider an allele A 
with an average allele frequency among the subpopulations equal to p. 
Migration is assumed to happen in such a way that the allele frequency 
among the migrants equals the average allele frequency among the subpop- 
ulations, namely, /'. The amount of migration is again measured by the para- 
meter w, which equals the probability that a randomly chosen allele in any 
subpopulation comes from a migrant. Let us consider a particular subpopu- 
lation with an A allele frequency of/?, in generation f For a randomly chosen 
allele in this subpopulation in generation t, the allele could have come from 
the same subpopulation in generation t - 1 with probability 1 - m, in which 
case it is an A allele with probability fVi- Alternatively, the allele could have 
come from a migrant in generation I - I with probability m, in which case it 
is an A allele with probability p. Because all evolutionary processes other 
than migration are ignored, p stays the same in all generations, Altogether, 

Pi =p f -i(\-m) + pm 


Equation 5.14 is similar to Equation 5.2 for mutation, and its solution in 
terms of p is 

Pt=p + (l-m) l (p -p) 


The similarity with Equation 5.13 is apparent: in fact, the equations are 
identical except that the role of p* in one-way migration is replaced with p in 
the island model. Perhaps less obvious is the similarity with Equation 5.4 for 
reversible mutation, in which case v/(u 4 v) plays the role of p and u 4 v 
plays the role of m. The correspondence between the equations again empha- 
sizes the similarity between the effects of migration and those of mutation. 
The processes result in similar mathematical expressions because both muta- 
tion and migration act linearly on allele frequency, which means that p, is a 
linear function of p M . Although Equation 5.15 for migration is mathematical- 
ly similar to Equation 5.4 for mutation, the biological implications are quite 
different Because rates of migration are typically much greater than rates of 
mutation, changes in allele frequency are generally much faster wilh migra- 

As an example of the use of Equation 5.15, suppose there are only two 
populations with initial allele frequencies of A of 0.2 and 0.8, respectively, 
with m = 0.10. Thus 10 percent of the organisms in either subpopulation in 
any generation are migrants having an allele frequency of A of p = 
(0.2 4 0.8)/2 = 0.5. What is the allele frequency of A in the two populations 
after 10 generations? For the population with initial allele frequency 0.2, we 
substitute p - 0.2, p = 0.5, and m = 0.10 into Equation 5.15 to obtain p 10 = 
0.5 4 (1 - 0.10) ]0 (0 2 - 0.5) = 0.395; for the other population, we substitute 
Pn - 0.8, p = 5, and m = 0.10, and so p w = 0.5 4 (1 - 0.10) '"(0.8 - 0.5) - 0.605. 

194 Chapters 

Migration rate = m = 1 

frequency = p 


20 30 40 

Time (t, in generations) 

Figure 5.16 Change of allele frequency with time in five subpopulations 
exchanging migrants at the rate w = 0.1 per generation. Note the rapid conver- 
gence to a common equilibrium frequency 

Another example using Equation 5.15 is shown in Figure 5.16, where there 
are five subpopulations (initial frequencies 1, 0.75, 0.50, 0.25, and 0), again 
with m = 0.10. Note how rapidly the allele frequencies converge to the same 
value, in this case, 0.5 

How Migration Limits Genetic Divergence 

It is remarkable how little migration is required to prevent significant genet- 
ic divergence among subpopulations as measured by, for example, the fixa- 
tion index F S1 . To understand the homogenizing effect of migration, consid- 
er the model in Figure 5.5 (page 171), in which two alleles drawn at random 
from a subpopulation in generation f are replicas of the same allele in genera- 
tion f - 1 with probability 1/2N and replicas of different alleles in generation 
I - 1 with probability 1 - 1 /IN. In the first case, the alleles are necessarily iden- 
tical by descent; in the second case, they are identical by descent with prob- 
ability F M , where F is shorthand for F sr . In either case, the identity by 
descent is unbroken only if neither allele is replaced by an allele from a 
migrant, and so 

HiB 1 -""'^-^) 1 -'"^- 


Illustrating again the analogy between migration and mutation, Equation 
5.16 is identical to Equation 5 8 measuring the effect of mutation on the 
probability of identity by descent, except that m replaces p. The equilibrium 
value F of F can be found by setting F = F t = F M ; after expanding the squared 

Sources of Variation 195 

terms on the right hand side, and assuming that m is small enough, and N 
large enough, that terms in m 2 and m/N can be ignored, some rearrangement 
leads to 

F = 


l + 4Nwi 


As might be expected. Equation 5.17 is identical in form to Equation 5.9 
for mutation but the biological implications are very different owing to the 
fact that the rate of migration is typically much greater than the rate of muta- 

The product Nm in Equation 5.17 has a straightforward biological inter- 
pretation. The totaJ number of alleles in a subpopulation of size N diploid 
organisms is 2N. In any generation, the proportion of alleles that are replaced 
by alleles from migrant organisms is m; hence the number of migrant alleles 
in any generation equals 2Nm. However, 2Nm is also the total number of alle- 
les in Nm diploid organisms, and so Nm can be interpreted as the absolute 
number of migrant organisms that come into each subpopulation m each 

Because the absolute number of migrants per generation equals Nm, 
Equation 5.17 implies that F decreases as the number of migrants increases. 
Indeed, the decrease in F with increasing Nm is extremely rapid, as shown in 
Figure 5.17. In the extreme case of complete genetic isolation between the 
subpopulations, Nm = and F = 1, The decrease is then so rapid that for: 

• Nm = 0.25 (one migrant every fourth generation), F = 50 

• Nm = 0.5 (one migrant every second generation), F = 0.33 

• Nm = 1 (one migrant every generation), F = 0.20 

• Nm = 2 (two migrants every generation), F = 0.1J 

The implication of Figure 5.17 is that migration is a potent force acting 
against genetic divergence among subpopulations. On the other hand, the 
homogenizing effect of migration should not be overestimated. The measure 
of genetic divergence in Figure 5.17 is F S7 , the value of which is determined 
by the variance in allele frequency among subpopulations (Equation 4 6) and 
so is affected primarily by polymorphic alleles that are at intermediate fre- 
quencies. Rare alleles present in one subpopulation but absent in others have 
hardly any effect on F sr . Because rare alleles are rare, they are unlikely to be 
included among migrant organisms unless the migration Tate is very great, 
and so rare alleles will tend to remain present in only one or a few subpopu- 
lations in a local area until such time as their frequency may become great 
enough to be dispersed by migration. An allele found in only one subpopu- 
lation is called a private allele. Next we shall see that the rate of migration 
oin be estimated by an examination of the frequency of private alleles 

196 Chapters 

1 2 ■} 4 5 

Number of migrant organisms per generation 

Figure 5.1 7 Decrease in the fixation index F s| among subpopulations at equi- 
librium in the island model ojf migration. The curve is that in Equation 5.17 giv- 
ing F as a function of Nm. In the island model, Nni is the number of migrant 
organisms that come into each subpopulation in each generation 

Estimates of Migration Rates 

One method of estimating genetic migration in natural populations relies on 
the finding that, in theoretical models, the logarithm of Nm decreases 
approximately as a linear function of the average frequency of private alleles 
in samples from the subpopulations (Slatkin 1985). Data on the average fre- 
quency of private alleles has been compiled and analyzed by Slatkin (1985), 
and the resulting estimates of Nni and equilibrium values of F ST are summa- 
rized in Table 5.1. There is obviously considerable variation in Nm among 
organisms. However, many of the values of Nm are smaller than about 2, 
which means that there is still considerable opportunity for genetic diver- 
gence among subpopulations. 

A second kind of approach to estimating Nm in natural populations is 
illustrated in Figure 5.18, which gives the distribution of estimated values ol 
F ST among 61 genes in natural populations of Dmsophila melanogaster (Singh 
and Rhornberg 1987). The average of the estimated values is F ST = 0.16, 
which, assuming equilibrium, is an estimate of 1 + ANm (Equation 5.17). The 
estimate is therefore Nm - [(1/0.16) - l]/4 = 1.3. This estimate is within the 
range for other Drosophila species in Table 5.1. However, there are many 
genes in Figure 5.18 that have F sr values greater than 0.30. An analogous 
method of estimating Nn: from the F sr values of polymorphic nucleotides 
within a gene is discussed in Hudson et al. (1994a). In Chapter 7 we will con- 
sider how Nm can be estimated from the genealogies of genes. 

Patterns of Migration 

Migration in actual populations is more complex than is assumed in the 
island model of migration. In nature, migrants come primarily from nearby 

Sources of Variation 197 



Slrpliiinoiiwria exigua 

Diowplitia ii'illishm 

Dtiwplnh pwuihdRcuni 

Omnvt' t'ltmias 

Hyla regillti 

Philiodmi ounchitae 

f'tethoiloii cincrrus 

Plcfhodon dorsaits 

Batmchoscps piicifim ssp 1 
BatHichosepF pneiften ssp 2 
Batrachtvepf- aimpi 
Laca fa indiscltcusis 
Pmmii/snts cnltfoniicus 
Thowoim/s bottac 

Type of 

Annual plant 
















Estimated Nm 

I 4 

9 9 


1 9 

Estimated F< 

Source Data from Slatkin 1Q85 

I) 025 


nnm,f; tin ^rT^' ', of e f h ™frd values of F ST for 61 genes among natur- 
al populations ol Drosvpink mchmo^stcr . Although the average value of F sr sug- 
gests m^raf on at a level of Nm between 1 and 2. about one-tLd of the genes' 
have F ST values greater than 0.20. {From Singh and Rhornberg 1987 ) 

198 Chapter 5 

populations To the extent that nearby populations have similar allele frequen- 
. cies, the effects of migration ate smaller, and sometimes much smaller, than pre- 
dicted by the island model. Populations in nature may be strung out along one 
dimension, such as a river bank Populations may also be distributed regularly 
in two dimensions, or there may be one large population with an internal 
genetic structure caused by the tendency for mating to take place between 
organisms born in the same region. Analysis of the effects of migration in such 
complex population structures is usually very difficult. Among humans, migra- 
tion rates depend on age, sex, marital status, socioeconomic status, population 
density, and many other factors. Migration rates also can change rapidly, and 
so a full-blown theory of migration has to be extremely complex. 

The effects of migration on genetic differentiation of populations are seen 
dramatically in Figure 5.19. Part A pertains to the moth Bistort betularia, part B 
to the moth GonodonHs bidcntata. Both species have evolved melanic (black- 
ened) forms in response to heavy air pollution, and the graphs give the fre- 
quency of the melanic forms in the two species. The geographical area in A 
includes Liverpool and Manchester, as viewed from rural Wales. Note the 
fall-off in frequency of melanics in the nomndustrial areas toward the front of 
the graph. Bisfon betularia exists in low population densities and must fly rel- 
atively long distances to find a mate. The resulting high rate of migration hin- 
ders differentiation of populations, hence the smooth surface. In contrast, 
Gonodonhs bidetJiata exists in high population densities and the migration rate 
is low; hence there is substantial genetic differentiation among populations, 
as evidenced by the bumpy surface of the graph in part B. 


A DNA sequence that can change its location within the genome is called a 
transposable element. In being able to create novel genome rearrangements, 
transposable elements are agents of genetic variation. A transposable ele- 
ment may insert into a coding region and inactivate a gene or insert into a 
regulatory region and change (he pattern of expression of the gene. Also, 
pairs of transposable elements may undergo recombination and create novel 
chromosome rearrangements. 

The process of transposition requires a protein, called transposase, which 
is usually encoded within the sequence of the transposable element itself. 
Most transposable elements undergo transposition through a replicalive 
process with DNA or RNA intermediates. In most cases, transposition to a 
new location also leaves one copy of the transposable element behind in its 
original location, so transposable elements can increase in copy number in 
the genome, Some transposable elements are also able to regulate their own 
rate of transposition. Several major classes of transposable elements can be 
distinguished by their nucleotide sequence organization or by the details of 
their mechanisms of transposition or regulation. 

Sources of Variation 199 


(center) 1 Stockport 



Clegyr Mawi 


Manchester (cenrei) 
Stratford j^TJTT^te,, Stockport 

(Broad green) 


Figure 5.19 (A) Distribu tion of melanic moths of the species Bistmi bchthu in over 
an area including Livcipool and Manchester, as viewed from Wales (B) Dis- 
tribution of melanic moths of the species Gonoriatifis bidvnMo over a smaller area 
man in (A) but viewed from the same perspective. (From Hishnp and Cook 1975 ) 

200 Chapter 5 

Factors Controlling the Population Dynamics 
of Transposable Elements 

Transposable elements were originally discovered in maize as the cause of 
certain genetically unstable mutations. They are now known to be ubiqui- 
tous among prokaryotes and eukaryotes (Berg and Howe 1989). The ability 
of transposable elements to increase in copy number and create novel chro- 
mosomal rearrangements reveals a dynamic aspect of genome structure and 
evolution not previously recognized Some transposable elements have 
become widely disseminated among organisms because of their ability to 
undergo horizontal transmission between reproductively isolated genomes. 
Often referred to as selfish DNA because transposition alone may be suffi- 
cient for persistence in the genome of a species, transposable elements also 
may occasionally create favorable mutations and thus become agents of 
adaptive evolution. 

Models for the population dynamics of transposable elements usually 
incorporate several features. 

• A rate of infection, in which genomes previously lacking the transposable 
element become infected with it. 

• A rate of transposition, which determines how rapidly the copy number 
increases; the effects of regulation are taken into account by assuming 
that the rate of transposition is a decreasing function of copy number. 

• A mechanism, or combination of mechanisms, for eliminating elements 
from the population; otherwise, the copy number would increase indefi- 
nitely. The usual assumption is that the presence of transposable ele- 
ments in the genome decreases the ability of an organism to survive and 
reproduce, resulting in the elimination of some elements by means of nat- 
ural selection, or that elements can be eliminated from the genome by 
means of genetic deletion. 

Through the study of such models, the diversity and novel attributes of 
transposable elements have been incorporated into the concepts of popula- 
tion genetics; see, for example, Langley et al. (1983), Montgomery and Lang- 
ley (1983), Kaplan and Brookfield (1983), Sawyer et al. (1987), Hartl and 
Sawyer (1988), Ajioka and Hartl (1989), Charlesworth et al. (1994). 

Insertion Sequences and Composite Tronsposons in Bacteria 

Bacteria contain several types of transposable elements. Among the simplest 
are insertion sequences, which are typically about 1000-2000 nucleotides in 
length and contain at least one long translational open reading-frame coding 
for the transposase protein. The transposase recognizes a short nucleotide 
sequence, inverted in orientation, present at each end of the insertion 
sequence, and so the element moves as an intact unit The bacterium 
Escherichia coii contains several types of insertion sequences, each different 
but all sharing the same sequence organization with inverted repeats and at 

Sources of Variation 201 

least one open reading frame. The factors controlling the population dynam- 
ics of insertion sequences can be deduced from the distribution of numbers 
of each element present among a sample of bacterial strains isolated from 
natural sources (Sawyer et al. 1987), 

Population models of transposable elements in E. colt are greatly simpli- 
fied because the organism has asexual reproduction, a low rate of recombina- 
tion among strains, and a low rate of deletion of insertion sequences. The 
"state" of a bacterial strain with respect to a particular insertion sequence 
may be defined as the number of copies n of the element that are present. 
Among the factors that control the population dynamics are: 

• The rate u at which uninfected cells become infected; u is the probability, 
per generation, that a cell initially in state n = ends up in the state n = 1. 

• The rate T of transposition in infected strains; T is the probability per 
generation, that a cell in state n > goes to state n + 1 . 

• The rate S at which reproduction of infected cells is less than that of unin- 
fected cells. In terms of the exponential growth model in Chapter 1, if r is 
the intrinsic rate of increase of uninfected cells (see Equation 1.7 on page 
30) and r„ is that of infected cells, then S = r - r n '. 

The most general models of this type allow for T and S to be functions of 
n, but here we will assume that they are constant. Note, however, that the 
assumption that T is a constant implicitly defines a type of regulation 
because, if the probability of transition from state n to state n + 1 is indepen- 
dent of n, then the probability of transposition per element present in a strain 
must equal T/n and this fraction is a decreasing function of h. 

Given constant values of u, T, and S, then it can be shown that a popula- 
tion of bacterial cells attains an equilibrium distribution of numbers of trans- 
posable elements in which the probability p, that a cell contains exactly f 
copies of the transposable element is equal to 


p = a 

p, = (l-a)(l-i 




where a = 1 - (u/S) and * = T/(T f S - it) (Sawyer and Hartl 1986, Sawyer et 

al. 1987). ' y 

Equation 5.18 can be applied to the concrete case of insertion sequence 
IS30 in E. coli, in which the distribution of numbers among 71 strains fits a 
model with a = V 2 and <$> = % With these parameters, the distribution simpli- 
fies to the remarkably simple formula p, = (i/ 2 )' for i > Among 71 strains, 
therefore, the observed and expected numbers of si rains containing t ele- 
ments are as indicated in Table 5.2. The strains with five or more elements 
have been grouped in order to carry out a X 1 test of goodness of fit. This * 2 
test has three degrees of freedom because a and $ were estimated from the 

202 Chapter 5 


Number of copies 
of /.BO element 

Fxpected number 
of strains 

Observed number 
of strains 







Siwirr Data from Sawyer et al l c '87 

data. The value of x 2 equals 3 48, which has an associated probability level of 
about 35. Thus, the simple model for 1S30 fits the observed data very well. 
Although the x 2 test cannot be completely trusted in this case because of the 
small expected numbers in some of the categories, the conclusion is support- 
ed by a more exact statistical test (Sawyer et al. 1987). The following problem 
deals with the distribution of three other insertion sequences in E. coli. 

PROBLEM 5.5 The distribution of ISJ fits Equation 5.18 with 
a = V 5 and ef» = %; IS2 fits the equation with a = 2 / 5 and <j> = 2 / 3 ; and fS4 
fits with a = % and ^ = % Calculate the expected numbers for 71 
strains and carry out a x 2 rest. (The observed numbers are from 
Sawyer et al. 1987.) 

Wo. copies 



























ANSWER For IS3, the expected distribution is given by p = l / 5 , 
P, = (%»)(%)' for 1 * ' £ 4, and p 55 = 1 - (p + p, + p 2 + ps + p4>- For 
IS2, the expected distribution is p = 2 /s and p- s = ( 3 /ioK%)' (1 S ( * 4). 
For IS4, the expected distribution is p = 2 / 3 and p/ = OAHW 

Sources of Variation 203 

(1 < / < 4). Expected numbers, % 7 values, and associated probabilities 

Wo. copies 






x z 

P value 




























As in the case of IS30, more exact statistical tests confirm the con- 
clusion that the model fits. However, the distribution of IS1 has a very 
long lail, with nine strains containing from 15 to 20 copies and six 
strains containing from 21 to 30 copies; this distribution is approxi- 
mated even more closely by a model in which the regulation of trans- 
position decreases more gradually than T/n (Sawyer et al 1987). 

Apart from their own evolutionary dynamics, insertion sequences are 
important because they can mobilize other sequences in the genome. When 
two copies of an insertion sequence are on flanking sides of an unrelated 
sequence, the inverted repeats used in tiansposition are preferentially those 
at the extreme ends. This kind of insertion-sequence sandwich constitutes a 
composite transposable element or transposon, which transposes as a single 
unit. In a composite transposon, the central sequence can include one or more 
genes that confer a selective advantage on the host cell, such as a gene for 
resistance to an antibiotic; hence, the possession of the transposon would be 
favored in an environment containing the antibiotic 

Mobilization of genes for antibiotic resistance, heavy-metal resistance, and 
other functions is one of the principal evolutionary implications of transpos- 
able elements in bacteria. Transposable elements enable the piecewise assem- 
bly of specialized, infectious molecules called plasmids. Plasmids are 
autonomously replicating, circular molecules of DNA that exist within bacte- 
rial cells. Many plasmids contain genes that promote their transfer between 
different organisms. They may also contain genes, such as those for antibiotic 
resistance, that are highly advantageous to their hosts in certain environ- 
ments. These genes are often contained in transposons, and they undoubted- 
ly entered the plasmid through transposition from a different plasinid or from 
the genome of a previous host Infectious plasmids containing multiple antibi- 
otic-resistance genes are called resistance transfer factors, and they are a 
major source of multiple drug resistance in pathogenic bacteria 

204 Chapter 5 

Trampoiable Elements in Eukaryates 

Transposable elements can have important genetic consequences as muta- 
genic agents by the creation of novel genes, by alteration of the expression of 
genes in their vicinity, and in the genesis of major genomic rearrangements. 
Transposable elements also have important implications in population 
genetics and evolution Several major classes of transposable elements have 
been identified that differ in the molecular mechanisms of transposition. 
Within each class, the members can also differ in DNA sequence. Based on 
similarity in DNA sequence, transposable elements typically can be grouped 
hierarchically into "subfamilies," in which the elements resemble each other 
quite closely; "families," in which they differ from one another somewhat 
more; and "superfamilies," in which the differences are relatively great. 
Transposable elements are widespread in both animals and plants. For 
example, Drosophila melanogaster contains multiple copies of each of 50 to 100 
different families of transposable elements (Rubin 1983). Although few of 
these elements have been studied in detail from the standpoint of population 
genetics, indirect evidence suggests that most of the elements, like insertion 
sequences in bacteria, are mildly harmful to the host (Golding et al. 1986, 
Loheetal 1995). 

Horizontal Transmission of Transposable Elements 

Among the most widespread families of transposable elements is that of the 
manner-like elements (MLEs), typified by the transposable element mariner. 
The molecular organization of the mariner element is illustrated in Figure 
5 20A. The element is flanked by short (28 base pair) inverted repeats (IR) 
and includes a long open reading frame coding for the transposase protein 
(Hartl 1989) Insertion of the element is invariably adjacent to a 5'-TA-3' din- 
ucleotide in the host genome and is accompanied by a duplication of the din- 
ucleotide, so that the inserted mariner is flanked by 5-TA-3'. The target 
sequence and dinucleotide, as well as features of the amino acid sequence of 
the transposase protein, identify a transposable element as an MLE. 

MLEs are widely distributed among insects and other invertebrates 
(Robertson 1993; Robertson and MacLeod 1993). Figure 5.20B shows the dis- 
tribution among species in the major insect orders (Coleoptera, Diptera, and 
so forth). The number of copies of an MLE per genome varies widely among 
species, ranging from a few copies to many thousands. The MLEs in Figure 
5.20B have been grouped according to similarity in nucleotide sequence and 
arranged in the form of a tree with the root to the left and the tips of the 
branches to the right. There are several subfamilies of insect MLEs, denoted 
mauritiana, cecropia, honeybee, and so forth. MLEs in different subfamilies 
are typically 40 to 50% identical in nucleotide sequence, and those within the 
same subfamily are usually 60% or more identical. All of the insect MLEs are 
more closely related to each other than they are to an MLE found in the soil 
nematode Cawiorliabditis eiegans. 

Sources of Variation 205 


IR Tmnsposast'-coding region IR 



2 Dip tern 

3 Homiptpra 

4 Hymenoptfra 

5 LcpiJoptera 

6 Thysanura 

-C elegmis 

Figure 5.20 (A) The molecular organization of the transposable element 
mariner showing the inverted repeats flanking the transposase-coding region. 
(B) Distribution of MLEs among species representing major insect orders (num- 
bered) Note that the MLEs can be grouped into subfamilies of elements (mauri- 
tiana, cecropia, and so forth) based on their similarity in sequence. C. elegans is 
the soil nematode Caenorhnbdilis elegans. (Data for B from Robertson 1993.) 

Although MLEs are widespread, their distribution is "spotty," which 
means that, among closely related species, a particular lype of MLE may be 
found in some species but not in others. Furthermore: 

• Any species may contain MLEs from two or more different subfamilies. 

• Closely related MLEs are often found in distantly related species. 

An example of the second principle is an MLE found in Drosophtla erecta, 
a close relative of D. melanogaster, which is 97% identical in nucleotide 
sequence with an MLE found in the cat flea CfcnocephaHdcs felts (Lone et al. 
1995). For comparison, a gene coding for a subunit of the cellular sodium 
pump sequenced in both species shows only 39% nucleotide identity at third 
cod on positions 

206 Chapter 5 

What process can account for the virtual identity between MLEs in 
species as distantly related as a Drosophila and a cat flea? One possibility is 
that the MLE was present in the common ancestor of the species a few hun- 
dred million years ago and then virtually stopped evolving, so that the 
sequences remain almost identical today Unless the nucleotide sequence is 
very highly constrained, including third codon positions, this is a very 
unlikely possibility. Furthermore, if MLE sequences are so constrained, then 
why is there so much sequence variability within and among subfamilies? 
More likely than evolution stopping dead in its tracks for several hundred 
million years is the hypothesis of horizontal transmission, or the ability of 
an MLE to be transferred from a host species into the germline of a different, 
reproductively isolated species. To account for the D. erecla-C. felis case by 
horizontal transmission, an MLE would have to have been transmitted from 
a D erecla ancestor to a C felts ancestor (or the other way around) approxi- 
mately 3 to 10 million years ago. Many additional examples of horizontal 
transmission of MLEs and other eukaryotic transposable elements have been 
discovered. Although the process of horizontal transmission certainly takes 
place, the rate at which it happens and the vectors and mechanisms are as yet 

Once introduced into a genome, MLEs can persist through multiple spe- 
ciation events (Maruyama and Hartl 1991). A lineage can, however, lose an 
MLE, as evidenced by D. melanogasicr, which has lost an MLE {the manner 
element itself) present in all its closest relatives. Two processes appear to con- 
tribute to loss of an MLE (1) mutational inactivation, which may destroy the 
protein-coding function of an MLE or impair its ability to transpose; and (2) 
stochastic loss, by which we mean the elimination of an MLE from the 
genome as a result of random genetic drift. There might possibly also be a 
contribution from natural selection, depending on the extent to which pres- 
ence of the MLE itself is deleterious. From the standpoint of the host species, 
an inactivating mutation in an MLE may be selectively neutral, or perhaps 
even favorable, inasmuch as natural selection may act to minimize the harm- 
ful mutagenic effects of transposition. Subsequent mutations in an already 
inactivated MLE are presumably selectively neutral and ultimately lost by 
chance. The role of mutational inactivation and stochastic loss in the evolu- 
tionary dynamics of MLEs is supported by the spotty distribution of MLEs 
among closely related species. 


Mutation provides the raw material for evolutionary change but, by itself, 
mutation pressure is a very weak force for changing allele frequency. If allele 
A mutates to allele a at a rate u per generation, and a undergoes reverse 

Sources of Variation 207 

mutation at a rate v per generation, then the equilibrium frequency of A is 
v/(u + v), but the population may require tens of thousands or hundreds of 
thousands of generations to reach equilibrium. In the infinite-alleles model, 
the equilibrium value of F ST for neutral alleles is given by 1 /(4Wu + 1 ), where 
u is the mutation rate to selectively neutral alleles; 4Np + 1 is called the effec- 
tive number of alleles. Fora neutral allele, the probability of ultimate fixation 
equals the frequency of the allele in the population. Statisiical tests of the 
neutrality hypothesis based on the effective number of allozyme alleles or on 
the allozyme heterozygosity are inconclusive owing to lack of statistical 

Recombination allows the formation of beneficial combinations of genes. 
Tn Drosophila, there is a positive correlation between the rate of recombina- 
tion and the level of nucleotide polymorphism: regions of reduced recombi- 
nation are less polymorphic. The reduced polymorphism could result from 
selective sweeps of favorable mutations or from background selection 
against detrimental mutations. In prokaryotes, there is extensive linkage dis- 
equilibrium over long genetic distances in spite of the fact that each gene 
may have a mosaic ancestry owing to intragenic recombination The appar- 
ent paradox results because recombination in prokaryotes usually involves a 
short stretch of DNA and the process is infrequent. In animal mitochondrial 
DNA, the absence of recombination enables the identification of mitochon- 
dria] lineages. 

Migration hinders genetic divergence among subpopulations In finite 
populations, the equilibrium value of F sl with migration is given by 
1 /(4Nm + 1), and only a few migrants per generation are sufficient to keep 
F ST smaller than about 10%. On the other hand, a small amount of migration 
is usually not sufficient to disperse rare alleles among subpopulations, and so 
rare alleles are often unique to one or a few subpopulations. 

Transposable elements are ubiquitous in the genomes of all organisms. 
Their tendency to increase in copy number through their ability to repli- 
cate and transpose is usually offset by the harmful effects of the insertions 
themselves; hence, there is an equilibrium distribution of copy number 
among organisms. Some transposable elements have direct or indirect ben- 
eficial effects; bacterial transposons that carry genes for antibiotic resis- 
tance provide an example. Bacterial transposons are disseminated among 
organisms and among species by transmission of infectious plasmids in 
which the transposons may reside. In eukaryotes, horizontal transmission 
can take place between species in spite of absolute reproductive isolation. 
Many transposable elements can be grouped into subfamilies, families, 
and superfamilies based on their degree of nucleotide sequence similarity. 
The mar/ncr-like elements (MLEs) are exceptionally widespread among 
insects and other invertebrates. The innate tendencv of an MLE to increase 

208 Chapter 5 

in copy number in .1 genome is offset by machvation and ulti- 
mately stochastic loss. These offsetting processes may explain the spotty 
distribution of MLEs observed among closely related species 


1. Most protein-coding genes have a forward mutation rate {normal to 
mutant) that is at least an order of magnitude greater than the reverse 
mutation rate (mutant back to normal). Why should this be the case? 

2. A classical bacterial experiment demonstrated that mutations occur at 
random and not in response to specific selection pressures for them. The 
experiment used sterilized velvet to imprint the geometrical pattern of 
bacterial colonies on an agar surface in a petn dish (a "plate"), which was 
used to replicate the pattern by impressing the velvet on sterile nutrient 
agar in a selective plate containing an antibiotic. Colonies on the original 
plate giving resistant cells on the selective plate were dispersed into sin- 
gle cells, spread onto a nutrient agar plate without antibiotic, and 
allowed to multiply into colonies. This procedure was repeated until one 
or more colonies on the unselective media consisted exclusively of antibi- 
otic resistant cells. How does this experiment prove the point? 

3. Estimation of mutation rates from bacterial cultures can be tricky 
because, it a mutation occurs early in the life of a culture, the final fre- 
quency will be very high; but if it occurs late, the final frequency will be 
low The fluctuation test is a method for getting around this problem by 
growing many smaller cultures and estimating the mutation rate from 
the proportion of cultures that contain no mutations using the zero term 
of the Poisson distribution P (1 = exp{-u N), where P is the proportion of 
cultures with no mutations, u is the mutation rate, and N is the average 
number of cells per culture. In one experiment for bacteriophage Tl resis- 
tance, ll / 2 i) cultures contained no mutations and the average number of 
cells per culture was 5.6 x 10 K . Estimate u 

4. If recessive lethals occur independently in Drosophila autosomes, and the 
probability that an autosome contains one or more recessive lethals is 
0.35 (a typical figure for chromosomes isolated from natural popula- 
tions), what is the average number of recessive lethals per chromosome? 
Assume that the distribution of lethals is Poisson so that the probability 
of a chromosome containing exactly / lethals is P, = (w'//!)exp(-«i), where 
in is the mean. 

5. The doubling dose ol radiation is the quantity of radiation that induces as 
many mutations as occur spontaneously, so the total mutation rate of 
organisms exposed to the doubling dose equals two times the sponta- 
neous mutation rate. Below are the induction rates per rad of x-rays (a 

Sources of Variation 209 

standard measure of dose) for various genetic end points in ir.ndialed 
male mice, along with the spontaneous rates. Wlml are the conesponding 
doubling doses? ^ 

Induction rate/rad 

Spontaneous rate 

Dominant lethals 5 x l(r' /gamete 

Recessive visihles 7 x lO"*/ focus 

Reciprocal translocations 1 to 2 x ItrVcell 

2 to IOxHr 2 /gamcfc 
Hxld V locus 
2 to 5 x 10" '/cell 

6. For irreversible mutation with a forward mutation rate u = 5 x 10"", cal- 
culate the allele frequency p after 10, 100, 1000, and 10000 generations, 
assuming;),) = 1.0. 

7. If a transposable genetic element becomes fixed at a particular site but 
undergoes deletion at the rate of one percent per generation, how many 
generations are required to decrease the frequency of the element at the 
site to 90%? 

8. The following data give the frequency q of bacteria resistant to a bacterio- 
phage after t generations of chemostat growth. At f = 12 hours a novel 
metabolite was added to the medium. 

a. What is the basal rate of mutation to resistance? 

b. What is the effect of the novel metabolite on the mutation rate? 





1 X Iff* 


7 04x10'"* 


3 x l(r fi 




5 x lfl- fi 





9. In the forward and reverse mutation model, what is the equilibrium fre- 
quency"^ of A if 

a. u = 10^andv = 10" f '? 

b. u is increased tenfold? 

c. v is increased tenfold? 

d. both are increased tenfold? 

10. In the forward and reverse mutation model, show that the time required 
for the allele frequency to go halfway to equilibrium is approximately 
t = Q.7/(u + v) generations. Use the approximation that ln(l - x) = -x 
when x is small. What time is required to go halfway to equilibrium 
when ji = 1(r 5 and v = 10*? 

11. In the irreversible mutation model, what is the frequency q, of allele a in 
generation t if the mutation rate changes from generation to generation? 
If the equation q, = q f u/ is applied to this situation, what value corre- 
sponds top? 

210 Chapter 5 

12. Suppose a gene has eight alleles at frequencies 0.55, 0.20, 0.09, 06, 0.04, 
0.03, 0.02, and 0.01. What is the effective number of alleles? What would 
the effective number be if each allele had a frequency of 0.125? 

13. Why is the elfective number of alleles essentially independent of the 
number ol rare alleles? 

14. What is the equilibrium heterozygosity in a population of effective size 
50 if new neutral mutations are introduced at a rate 10~ 5 by mutation and 
at a rate 10~ 3 by migration? 

15. If the average number of alleles of a gene is 1 + x per diploid individual, 
where < x < 1, Ihen what is the heterozygosity? (Note that one diploid 
individual is a random sample of two alleles.) 

16. Calculate the autozygosity F after 200 generations in a random mating 
population of effective size N = 50. 

17. In an isolated random mating population of effective size N, how many 
generations of random genetic drift are required to produce the same 
average inbreeding coefficient F as obtained in one generation of brother- 
sister mating (for which F = V*)? Use the approximation [1 - 1/(2N)1 = 


18. If a mainland population of snails has an allele frequency of 0.8 and an 
island population has a frequency of 0.2, how many generations are 
required for the island population to achieve an allele frequency of 0.5, 
given a migration rate of 01 ? . 

19. If four populations with allele frequencies 0.2, 0.4, 0.6, and 0.S undergo 
migration according to the island model with m = 0.05, what are the 
expected allele frequencies after 10 generations? 

20 In the island model of migration, how does the variance in allele fre- 
quency among populations change as a function of m and f? 

21 When random genetic drift is offset by migration among populations in 
the island model, what value ol m is necessary to keep the equilibrium 
value of F smaller than 0.05? 


Darwinian Selection 

Natural Selection Fitness ■ Haploid Models Diploid Models 

Mutation-Selection Balance ■ Complex Modes of Selection 
Kin Selection . Interdeme Selection 


hus far in this book, the term natural selection has been used in 
the informal, intuitive sense used by Darwin in The Origin of 

Species (1859): 

Owing to this struggle ior life, variations, however slight and from whatever 
cause proceeding, if they be in any degree profitable to the individuals of a 
species, in their infinitely complex relations to other organic beings and to 
their physical conditions of life, will tend to the preservation of such individ- 
uals, and will generally be inherited by the offspring. The offspring, also, will 
thus have a better chance of surviving, for, oi the many individuals of any 
species which are periodically born, but a small number can survive I have 
called this principle, by which each slight variation, if useful, is preserved by 
the term Natural Selection 

Modern formulations of natural selection are less literary and usually 
compacted into a form resembling a logical syllogism: 

• In all species, more offspring are produced than can possibly survive and 

• Organisms differ in their ability to survive and reproduce— in part owing 
to differences in genotype. 

• In every generation, genotypes that promote survival in the current envi- 
ronment are present in excess at the reproductive age and thus contribute 
disproportionately to the offspring of the next generation. 


212 Chapter 6 

Through natural selection, therefore, alleles that enhance survival and 
reproduction increase gradually m frequency from generation to generation, 
and the population becomes progiessnely better able In survive and repro- 
duce in the environment. The progressive genetic improvement in popula- 
tions resulting from natural selection constitutes the process of evolutionary 

In the brief description of natural selection quoted above, Darwin uses the 
term individual three limes The unit of selection is the individual organism — 
not the species, not the subpopulation, not the sibship It is the performance 
of the individual organism that matters. Each individual organism competes 
in the struggle for existence and survives or perishes on its own. Darwin also 
used the terms "struggle for existence" and "survival of the fittest" as syn- 
onyms for natural selection, but he emphasized that he employed the terms 
in their widest metaphorical sense to include not only the life of the organism 
but also the success of the organism in leaving progeny: fecundity is as 
important as survival In this chapter, we shall see how Darwin's concept of 
"survival of the fittest" of individual organisms has been made more formal 
and quantitative and incorporated into models describing the change in 
allele frequency under natural selection These models show that natural 
selection acts simultaneously on different components of fitness and can 
operate at different levels of population structure. 


Selection acts on the phenotype, not on the genotype, and the total pheno- 
type is determined by many genes that interact with each other as well as 
with numerous environmental factors. However, in exploring the conse- 
quences of selection, it is convenient to focus on changes in the frequency of 
the alleles of a single gene. We shall begin by examining selection in its sim- 
plest form operating in a haploid, asexual organism, such as a species of bac- 
teria. In haploids, selection is realized as differential population growth; 
hence we shall make reference to the discrete and continuous models of pop- 
ulation growth examined in Chapter 1. The overall process of selection is 
identical whether population growth is in discrete or continuous generations, 
but the models have a somewhat different parameterization and it is neces- 
sary to relate the models to avoid confusion later. 

Discrete Generations 

Consider two bacterial genotypes, A and B, that reproduce asexually For 
simplicity, we will assume the discrete model of population growth dis- 
cussed in Chapter 1 and we set a and b equal to the rates of population 
growth of A and B, respectively. Equation 1.5 implies that A, = (1 + a)'A Q and 
B, = (1 f b)'B l)r where A, and B, are the number of cells of genotype A and 

Darwinian Selection 213 

genotype B, respectively at time / Select, cm takes place when a * /, Figure 
6.1A is an example m which the growth rates of A and B are a = 004 and b = 
0.05, respectively Both populations increase in size exponentially, but tha of 
B increases faster than that of A. In most cases, we are not mtef sted in the 
actual number of A cells or B cells but in the proportion of all cells that are of 
ype A Equivalents we can examine the ratio of the number of A cells o 
that of B cells at time r, which is given by 

The outcome of selection is determined by the ratio of a to b because if 
a < b, then the ratio of A cells to B cells decreases until, ultimately, A is lost; 



=9 6x10" 


Z 2 x Ut 

1 xW H 

Stiain B 

Strain A 

20 40 60 80 100 

Time {I, in generations) 



20 40 60 80 

Time (f, in generations) 

5? a T V i « } Disc l et l P ? puMion S rowth of lwo hypothetical bacterial 
strains, A and B in which the growth rate are 4% per generation for A and ■>% 

Son" Tn 10P ? i K u OJ d T ty ' '^ POpU K i(m SJZe is P blted ««y second gen- 
n roM h I'" 1 * 13 / T '' n » mbers are 1 -6 * 1° 5 ^r A and 0.4 x 10" for B. (B) Ratio 

Tat on ?h m S ° \ A ■ V/ CaU , Se *" B P°P ulati ™ 8'ows ^ter than the A popu- 
lation, the proportion of A in the total population decreases. 

214 Chapter 6 

conversely, if a > b, then the ratio of A cells to R cells increases without limit. 
Figure 1 .6B shows the change in A/B for the example in part A. From a value 
of 4 at the beginning, the ratio declines to a value of 1-54 in 100 generations; 
these ratios correspond to frequencies of A of 80 and 0.6!, respectively. 

In the selection in Figure 6. 1, it is not necessary to specify whether a and b 
differ because of survivorship or fecundity All that matters is that they do 
differ It is also important that the outcome depends only on (he ratio 
(1 + /?)/(! + b). which means that, in practice, we do not need to know the 
absolute growth rates of A and B but only their relative values (their ratio). In 
Equation 6.1, w represents the ratio (1 + fl)/0 + &)■ The symbol w is conven- 
tionally used in discrete models of selection and, in this example, it is the rel- 
ative fitness of genotype A to that of genotype B. In other words, in a haploid 
organism, the relative fitness equals the ratio of the growth rates. 

Although it is sometimes instructive to do so, it is not necessary to keep 
track of population size in models of selection. The variable of interest is usu- 
ally the allele frequency and not the population size. Therefore, let p, and q, 
represent the frequencies of genotypes A and B, respectively, in generation i, 
with p, + q t = l A method to relate the frequencies of A and B in any two suc- 
cessive generations is illustrated in Table 6 1. For ease of discussion, we 
divide each generation into three phases' birth, selection, and reproduction. 
In generation f - 1, the frequencies of A and B at birth are p h i and q,. Xt respec- 
tively. The genotypes A and B are assumed to survive in the ratio w:\, which 
means that w is the probability of survival of an A genotype relative to that of 
an B genotype. As before, the absolute probabilities of survival of the geno- 

TABLE 6.1 



General ion I 

-I /I 



before selection /'i-i 


Relative fitness »' 


Aftei so lee 

lion Pm«' 




Generation / 

/>, |?e + ij,_i Pi- 

w + <; M The It-actions in the bottom line are expressions foi the allele Frequencies in generation I 
m terms of those in t^neution I - I Although this model assumes survival, n< 1 
couM ..Iso he the lebtivc pmb.ihilily of reniodutlmn iif A B More generally, the rclnhvo 
fitness em 1 repiesents I ho net output of A R for the combined effects survival 
<ind icproduction 

Darwinian Selection 215 

types are not relevant. All that matters is the ratio. After selection, the ratio of 
frequencies of A : B equals p M x w : ?M x 1 . If the surviving genotypes repro- 
duce with equal efficiency, then the frequencies at birth in the following gen- 
eration are given by the expressions across the bottom in Table 6.1; the 
denominators in these expressions are necessary to make the allele frequen- 
cies in generation f sum to 1, 

For comparison with Equation 6.1 , consider that p, is the number of A 
cells in generation / divided by the total; likewise, q, is the number of B cells 
divided by the total. Therefore, the ratio p t /q, equals the ratio of A cells to B 
cells in generation t because the denominators cancel. The expressions in 
Table 6.1 imply that the ratio oipfqin any generation equals w multiplied by 
the ratio otpfq in the previous generation, and so 

E< =w Ei± = w 2Eti = ... = w 






The right-hand side of Equation 6.2 is identical to that in Equation 6.1 
except that the relative frequencies p and q replace the absolute number of 
cells of type A and type B. Hence, to deduce the outcome of selection, we do 
not need to keep track of population size. AH we need to know is the relative 
fitness w and the initial frequencies p and ij . 

For application to experimental data. Equation 6.2 is often transformed by 
taking the logarithm: 


' 08| ^ l08 te) +MogM 


Equation 6.3 means, for example, that if the values of p,/<j, are monitored 
in an experimental population of bacteria over the course of time, then a plot 
of log (p,/q t ) against time (in generations) should yield a straight line with 
slope equal to fog w. This kind of experiment is examined in the following 

PROBLEM 6.1 In the intestinal bacterium E. colt, the genegmf codes 
for the enzyme 6-phosphogluconate dehydrogenase (6PGD), which is 
used in the metabolism of gluconate but not in the metabolism of 
ribose. The data below were obtained in experiments in which other- 
wise genetically identical strains Containing fhe alleles gnd(RM77C) 
and gnd(RM43A) were grown in competition in chemostats in which 
the sole source of carbon and energy was either gluconate or ribose 
(Hartl and Dykhuizen 1981). These grid alleles are polymorphic in 

216 Chapter 6 

natural populations and code for allozymes of 6PGD Gluconate is the 
experimental condition to ascertain the effects on fitness of the gnd 
alleles, and ribose is the control. In the table, p, denotes the frequency 
of the strain containing gnd(RM43A) after t generations of competi- 
tion. From the two points under each growth condition, estimate the 
fitness of the strain containing gndtRMi3A) relative to that containing 
gnd(RM77C) under the growth condition: 

Growth medium Pa Pis 




ANSWER In gluconate medium, log (0.898/0.102) = log (0.455/0545} 
+ 35 x log w, and so log w = 0.0292, or w = 1.0696. Hence, the allele 
gnd(RM43A) confers about a 7% selective advantage in competition for 
utilization of gluconate. In ribose medium, w = 0.999, a value that is 
not significantly different from 1.0, and so the alleles appear to be 
functionally equivalent in this environment. (There were more than 
two points in the original data, and the estimates of fitness were based 
on the slope of the linear regression; here we have quoted only two 
data points for computational convenience.) 

Continuous Time populations such as those in Problem 6.1 do not reproduce in dis- 
crete generations but instead they reproduce continuously. In a continuous 
model, the exponential population growth of A and B are governed by the 
equations dA(t)/dt = a'A{\) and dB(t)/dt = b'B(f), where a' and ¥ are the 
growth rates. Therefore, A(t) = A{0) ex P "'' and B(t) = B(0) exp " (Chapter 1), 
and so 



M0)_^. by = M0) 




Equation 6.4 means that, in a continuous population, the outcome of 
selection depends on the difference between the exponential growth rates 
a'-b', which is represented by the symbol m on the right-hand side, The value 
of in also measures the relative fitness of strain A relative to strain B, but in a 

Darwinian Selection 21 7 

continuously reproducing population Comparing Equation 6.4 with Equa- 
tion 6.1 yields the relation between m and w 

m = In w 

6 5 

In other words, the relative fitness with continuous growth in equals the 
natural logarithm of the relative fitness with discrete reproduction w Selec- 
tive neutrality means that to - 1 or that m = 0. For the values ol w estimated in 
Problem 6.1, the corresponding values of m are 0.0673 and -0.001, respective- 
ly. If w is not too different from 1, then m = w - 1 is a reasonable approximation. 

Change in Allele Frequency in Haploids 

Although the discrete and continuous models are completely equivalent 
under the transformation in Equation 6 5, the equations for change in allele 
frequency look rather different. In the discrete model, the change in the fre- 
quency of strain A in generation t is given by the difference p t - p M , which 
can be calculated in terms of p M from the formulas in Table 6.1. The differ- 
encepj-pM is usually symbolized A/rand, for simplicity, the subscript I - 1 is 
suppressed. Using the expressions in Table 6.1 and the fact that q = 1 - p, we 

Ap = 

pw _ pq(iv~1) 

pxv + q pw + q 


Not surprisingly, p increases if the relative fitness of A is greater than 1 
and decreases if the relative fitness of A is smaller than 1. If the relative fit- 
nesses of A and B are equal, then p does not change — provided that the pop- 
ulation size is very large (theoretically, it has to be infinite) 

The analog of Equation 6.6 in a continuous model contains the derivative 
dp/dt in place of Ap. This we can obtain from Equation 6.4 with a little trick- 
ery. Because A(t)/B(t) equals p(t}/q(t), the derivative of Equation 6.4 with 
respect to r must equal the derivative of p(l)/q{t) with respect to t. For sim- 
plicity, we will write p and q instead of p(f) and q(l). The derivative of Equa- 
tion 6.4 with respect to t equals mp/q and the derivative of p/q with respect to 
t equals (1 /q 1 ) x dp/dt. Setting these expressions equal to each other and 
solving for dp/dt, we obtain 



Voila! There is no denominator! What happened to it? In a technical sense, 
it disappeared into the difference between the discrete model and the contin- 
uous model. In a practical sense, the absence of a denominator in Equation 
6 7 greatly simplifies some of the formulas to come, especially those con- 

218 Chapter 6 

corned with random genetic drift in Chapter 7 Although they look very dif- 
ferent. Equations 6.6 and 6.7 are merely different ways of saying the same 
thing. In this chapter, we will deal mainly with expressions analogous to 
Equation 6.6 because they are more easily derived for various types of selec- 
tion. However, when it is necessary to dispose of a troublesome denominator, 
we will invoke the continuous model in Equation 6.7 and be rid of it. 

Darwinian Fitness and Matthusian Fitness 

The distinction between the fitness parameters in the discrete and continuous 
models has been incorporated into the terminology of population genetics in 
the terms Darwinian fitness, which refers to the discrete model, and 
Malthusian fitness, which refers to the continuous model. The latter is 
named after Thomas Mai thus (1766-1834), whose views on the implications 
of continued population growth strongly influenced Darwin's thinking on 
the subject. A Darwinian fitness is conventionally represented by the symbol 
xv, often embellished with a subscript, and Malthusian fitness is convention- 
ally represented by the symbol in. In this book, the lerm/i'f ness, when used 
without qualification, will mean Darwinian fitness unless it is clear from the 
context thai some other meaning is intended. 


In diploid organisms, the consequences of selection are most conveniently 
explored under the model of random mating in Chapter 3, but incorporating 
selection by permitting the (itnesses of the genotypes to differ. Selection is 
assumed to take place on the diploid genotypes. We shall use the conven- 
tional symbols w n , w i2r and <i> 22 to represent the Darwinian fitnesses of the 
genotypes AA, An, and aa, respectively. The simplest way to interpret the fit- 
nesses is in terms of survivorship, usually termed viability, which is the 
probability that a genotype survives from fertilization to reproductive age. If 
the fitness of each genotype is set equal to its probability of survivorship, 
then each fitness is an absolute fitness because its value is independent of 
the fitnesses ol the other genotypes. In practice, we usually know only the 
value of the viability of each genotype relative to that of another genotype 
chosen as the standard ot comparison. When a fitness value is expressed rel- 
ative to that of another genotype, the fitness is a relative fitness. The relative 
fitness of the genotype chosen as the standard of comparison is arbitrarily 
assigned the value 1. 

To consider a specific example, suppose that the genotypes A A, Aa, and aa 
have probabilities of survival from conception to reproductive age of 0.75, 
75, and 0.50, respectively. These are the absolute viabilities of the geno- 
types. They can be judged realistic or not only if we specify the organism 
They may be plausible values if the organism is a mammal or a bird because 
each offspring has a reasonable chance of survival, but implausible if the 
organism is an insect or an oyster because, in these organisms, most 

Darwinian Selection 219 

newborns are destined not to survive. Because selection depends on the rela- 
tive magnitudes of the viabilities, it is usually most convenient to express the 
viabilities in relative terms. Taking genotype AA as the standard, the relative 
viabilities of AA, Aa, and aa are 0.75/0.75, 0.75/0.75, and 50/0 75, or 1 0, 1.0, 
and 0.67, respectively. Equivalently, we could choose genotype aa as the stan- 
dard, in which case the relative viabilities are 75/0.50, 0.75/0.50, and 
0.50/0.50, or 1.5, 1.5, and 1.0, respectively. Usually, the relative viabilities are 
calculated so that the largest relative viability equals 1.0 The relative viabili- 
ties are equal to the relative fitnesses of the genotypes provided that the 
genotypes are equally capable of reproduction. Viabilities expressed in rela- 
tive terms are as valid for osprey as for oysters because the relative fitnesses 
are the same whether the absolute fitnesses are 0.75, 0.75, and 0.50 or 00075 
0.00075, and 0.00050. 

Change in Allele Frequency in Diploids 

If we write the allele frequencies of A and a as p, and q„ respectively, in gen- 
eration f, then it is straightforward to derive expressions for the allele fre- 
quencies in generation t m terms of the allele frequencies j? M and q t . , in the 
previous generation. The subscripts I and f - 1 are rather cumbersome to 
carry along in equations, so we will use the symbols p and q for p,_ , and q,_ u 
and the symbols p' and q' for p t and q t . 

The relation between the allele frequencies in two consecutive generations 
is deduced in Table 6.2, where the fitnesses w n , n> 12 , and w 22 are the relative 
viabilities. In generation f - 1, the genotype frequencies of AA, Aa, and aa 


Generation r - 1 

Frequency before selection 
Relative fitness (viability) 
After selection 








2 V q 

q 2 

1=;> 2 42^ + rf 



11 '22 


2pqiv v 

<fw 22 

?7' = /r f j'„ + 2jiqw v -npP ?2 

P 2 "'n 

2pqw v 

q ? w 22 


q = 

p 2 u>„ +t>qw n 

_ IW"\2 + >f'<'21 

Note The allele frequencies p and q are those in gametes immerlr.ilelv prior to ferlilt/ation trio AA, Aa nnd ,„t 
zygotes survive to reproductive maturity in the ratio rr„ t<v "'" A H Reneilvpes, ns .xkilK, are assumed to 
nave the same reproductive capacity 

220 Chapter 6 

among newly fertilized eggs are given by p""', 2pq, and (f, respectively, assum- 
ing random mating By definition, newly fertilized eggs survive in the ratio 
K'n : ii)| 2 . w 22 , and so the ratio of AA : An mi among surviving adults is 

p 2 w u :2pqw u 1 2u '2i 

To proceed, we need to convert the terms in the above expression into rel- 
ative frequencies by dividing each term by the sum. The value of the sum is 
indicated in Table 6.2 as 

w = p 2 w u +2pqu>\ 2 +q 2 u> 22 


The symbol w is the average fitness in the population in generation f - 1. 
Division of each term in the ratio of survivors by u> yields the genotype fre- 
quencies among adults- 


p-w n 




iflV 22 


Among the surviving adults, the AA genotypes produce all A gametes, 
the An genotypes produce V 2 A and ] / 2 a gametes, and the aa genoK \ >es pro- 
duce all a gametes. Hence, the frequencies of the gametes that unite at ran- 
dom to form the zygotes of the next generation are: 


,_ p 2 w u +pqu> ]2 

a- q = 

_ }Wn + q 2 ^27 


These are the relations we were after because they express the allele fre- 
quencies in any generation in terms of the allele frequencies in the previous 
generation. From these equations, the outcome of selection can be deduced. 

As in the haploid model, it is often useful to know A;', which is the differ- 
ence in allele frequency p' - p resulting from one generation of selection. 
Subtraction of/? from the expression for p' in Equation 6.10 and a little manip- 
ulation leads to: 

Ap _ P#("'ll -«»12) + <?(W|2 -ttte)] 


Equation 6.11 is the diploid analog of that in the haploid model in Equa- 
tion 6.6. 

At this point, an example of the use of these equations is in order. We will 
use data on the change in the frequency of the Cy {Curly wings) allele in a 
laboratory population of Drosopliila mclmiogastcr, which are plotted in Figure 
6.2. The Ci/ allele is lethal when homozygous, so w u = 0. The points in Figure 
6.2 pertain to the frequency of Cy heferozygotes but, because Cy/Cy geno- 

Darwinian Selection 221 

2 3 4 5 

Time (f, in generations) 

Figure 6.2 Change in frequency of ad ult Drosoplula im'lniwgaster heterozygous 
for the dominant mutation Cy (Curly wings) in an experimental population The 
genotype Cy/Cy is lethal. The curve represents the theoretical change in fre- 
quency when the ratio of viabilities of Cy/+ to +/+ is 0.5 ■ 1 (Data horn Teissier 
1942. The fitness value of 0.5 was estimated by Wright 1977.) 

types do not survive, the allele frequency p of Cy equals one-half the fre- 
quency of Q//+ adults. The points in the figure are each separated by one 
generation, and the initial generation has a frequency of Ci//+ adults of 0.67, 
hence p = 0.335 and thus q = 0.665. Wright (1977) has studied these data and 
concluded that w l2 = 0.5 for Cy/+ genotypes, relative to a value of w 22 - 1 
tor +/+ genotypes. Substituting these values for/?, q, iv n , u>\ 2 , and w 22 into the 
expression for p' in Equation 6. 10 yields 

V = 

0.335 3 x + 0.335 x 0.665 x 5 
0.335 2 x 4- 2 x 0.335 x 0.665 x 0.5 + 6h5 2 x 1 

= 168 

Therefore, the predicted frequency of Cy/+ adults in the generation 1 is 
2;/ = 0.336, which is reasonably close to the observed value of 368. 

222 Chapter 6 

PROBLEM 6.2 Assume a value of p = 0.1 68 for the frequency of the Cy 
allele in generation 1 in the population in Figure 6.2. Calculate the expect- 
ed frequency of Q//+ heterozygotes among adults in generation 2. 

ANSWER In this case, p' = [0 168 2 x + 0.168 x 0.832 x 0.5] / w, where 
w = 0.168 2 x + 2 x 0.168 x 0.832 x 0.5 ■+ 0.832 2 x 1.0 = 0.832; hence p' = 
0.0699/0.832 = 084. The expected frequency of Cy/+ adults is If = 
0.168. This result is very close to the observed value of 0.165. The 
theoretical curve in Figure 6.2 was calculated using the same genera- 
tion-by-generation algorithm. 

We make a slight digression to point out thai it is sometimes convenient 
to think in terms of the marginal fitnesses of the A and a alleles. The mar- 
ginal fitness equals the average fitness of all genotypes containing A or a, 
respectively, weighted by their relative frequency and the number of A or a 
alleles they contain. For example, A alleles are found in AA and Aa genotypes 
in the proportions p and q and, therefore, the marginal fitness ii>\ of /\ -con- 
taining genotypes equals pzt'u + r/?i' 12 . Similarly, the marginal fitness of fl-con- 
taining genotypes is r7> 2 = pw ]2 +• <7">22- The expression for p' in Equation 6.10 
thus becomes p' = pii>\/w, and Equation 6.11 becomes Ap - p(w\ -w)/w. This 
expression makes it clear that any allele increases in frequency if the margin- 
al fitness of genotypes containing the allele (w{) is greater than the average 
fitness in the population (fir). This approach also generalizes readily to multi- 
ple alleles' for an allele with frequency p, and marginal fitness w„ the change 
in frequency in one generation equals 


Time Required for a Given Change in Allele Frequency 

I laving dei ived Equation 6.1 1 for Ap resulting from one generation of selec- 
tion, it is an appropriate next step to express p, in terms of p<>, as we did in 
Chapter 5 for the analogous equations involving mutation and migration, For 
any specified values of the initial allele frequencies and the fitness parame- 
ters, the allele frequencies can be determined generation after generation by 
computer iteration, as in Problem 6.2 More generally one might want an 
explicit mathematical formula for/), in terms of p n , but Equation 6 11 does not 
lend itself to analytical solution 

Darwinian Selection 223 

There is an alternative approach based on a continuous model, however 
If the fitnesses are expressed as Malthusian fitnesses rather than as Darwin- 
ian fitnesses, then the analog of Equation 6.11 for a continuously growing 

population is 



= pq[p(m n - m 12 ) + q(m u - m 22 )\ 


where the values of m are the malthusian fitnesses. Note that there is no 
denominator in Equation 6.13 because it disappeared in the same way as the 
denominator in Equation 6.7. A less elegant way to derive an equation like 
Equation 6.13 is to suppose that the Darwinian fitnesses are all quite close to 
1; then the change in allele frequency is slow enough that Ap = dp/dt and, fur- 
thermore, it) = 1. Under these conditions, Equation 6.11 takes the form of 
Equation 6.13 with the m values replaced with w values. 

To solve Equation 6.13, the terms are rearranged to isolate those in p on 
one side and those in f on the other, then one side is integrated over p from p 
to p,, and the other integrated over f from to f. The details are left as an exer- 
cise. The answers are most easily presented if we change the symbols For 
this purpose, we rewrite the fitnesses of the genotypes as follows: 

w n = 1 w n =l-hs iv 2l = 1 - s 

m X \ = m 12 = -fis m 22 = -s 

where the Malthusian fitnesses follow from the approximation m tj = w, } - 1 
when w, } = 1. Use of the h and s symbols for the fitnesses has the advantage of 
making the amount of selection and the degree of dominance explicit. 

If s is positive and h is not negative, selection favors genotypes carrying 
the A allele. In this context, s is called the selection coefficient against the aa 
genotype, and h is called the degree of dominance of the a allele. For exam- 
ple, when h - 0, the Darwinian fitnesses of AA, Aa, and aa are 1, II, and 1 - s, 
respectively, and a is completely recessive to A. Alternatively, when h - 1, the 
Darwinian fitnesses are 1, 1 - s, and 1 - s, respectively, and a is completely 
dominant to A. In terms of the selection coefficient and the degree of domi- 
nance, dp/dt of Equation 6.13 becomes 

-£ = pqs[ph + q(\-h)] 


The following equations give p, in terms of p in three cases of importance. 
• A is a favored dominant In this case h - 0. Then dp/dt = pq 2 $, and 

if i J ft Uoj q» 


* A is favored and the alleles are additive in their effects on fitness. Addi- 
tive effects on fitness means that the fitness of the heterozygote is exactly 

224 Chapter 6 

intermediate between the fitnesses of the honuvygotes, and so // = '/.. The 
additive case is also referred to as semidominance or as genie selection 
When // = Vi, then dp/dt = fup/2, and 

/'' - 

'"^j= h n*ru" 



Note that Equation 6.16 for additive alleles is similar in form to Equation 
6 3 for haploid selection when w - 1 + s/2 and s is small. In other words, 
slow selection of additive alleles in a diploid species is mathematically 
almost equivalent to selection in a haploid sp ecies. In Problem 6.3, you 
will see that the precise requirement is zt> l2 = ^(wuii'n)- 
A is a favored recessive. In this case, h = 1, so dp/dt = p 2 qs, and 

% ) Pi 

+ sr 


Some of the practical implications of these equations are explored below. 
Problem 6.3 explores a little more deeply the relation between selection in 
haploid species and selection in diploid species. Figure 6.3 illustrates the 
changes in allele frequency for Equations 6.15 through 6.17. 

PROBLEM 6.3 The discrete model of selection in a haploid species 
is completely equivalent to that in a diploid species it in the diploid, 
the Darwinian fitness of the heterozygote equals the geometric mean 
of the D arwinian fitnesses of the homozygotes — that is, if w }2 - 
^(wiiWiz)- Show that, in this case, Equation 6.3 for &p in a haploid 
species is, indeed, identical to Equation 6.11 for Ap in a diploid 
species. What is the equivalent value of w in the haploid in terms of 
the Darwinian fitnesses in the diploid? 

ANSWER Substitute W\<i = Vw^Vw^ into Equation 6.11. The num- 
erator simplifies to pq x (vo^ - 4w&) x (p4u^ } + q~Jwn). The denomi- 
nator simplifies to (pVw[j + ijVmJ^) 2 . Therefore, 

ZT\ W\ 

Ap = 

v4 w n +flVw22 


Darwinian Selection 225 










* ,.... .... 

Recessive ^ / 


l- -i ■ 1 1 1 

1 1 1 . 1 i.j 

100 200 300 400 500 600 700 800 W0 1000 1100 1200 
Number of generations 

Figure 63 The change in frequency p of a favorable allele that is either domi- 
nant, additive, or recessive in its effect on fitness. The frequency of a favored 
dominant allele changes most slowly when the allele is common, and the fre- 
quency of a favored recessive allele changes most slowly when the allele is rare. 
In all three examples, the difference in relative fitness between the homozygous 
AA and aa genotypes is assumed to be five percent. 

This is in exactly the same form as Equation 6.6 with w = iwn/w^. 
Taking w^ = 1 as the standard, w in the haploid model equals the fit- 
ness of the heterozygote in the diploid model. More specifically, let 
w n = (1 + s/2) 2 , w u = 1 + s/2, and w n = 1. If s is small compared to 1, 
then Wu = (1 + s/2) 2 sl + s, which implies that the Darwinian fitness- 
es are approximately additive. Furthermore, &p = pqs/2 r which has 
the same form as dp/dt in the additive case leading to Equation 6.1 7. 

PROBLEM 6.4 A certain highly isolated colony of the moth Panax- 
ia dominula near Oxford, England, was intensively studied by Ford 
and collaborators over the period 1928 to 1968 (Ford and Sheppard 
1969). This colony contained a mutant allele affecting color pattern. 
The frequency of the mutant allele declined steadily over the period 
1939 to 1968. Indeed, the accompanying steady increase in the fre- 
quency of the normal allele followed Equation 6.16 for additive 
genes with s = 0.20 (Wright, 1978, shows a graph). The species has 
one generation per year, and the estimated frequency of the mutant 
allele in 1965 was 0.008. (This value is actually the average for the 

226 Chapter 6 

seven-year period 1962 to 1968.) Estimate the frequency of the 
mutant allele in 1950 and in 1940. 

ANSWER Here we are given q t and want to use Equation 6.16 to 
estimate q . Between 1950 and 1965, there were I = 1965 - 1950 = 15 
generations. We are given q, = 0.008, hence p, = 0.992 and In (0.992/ 
0.008) = 4.820. Thus, 4.820 = In (p /<?o) + (0.20/2) x 15, or In (p /q Q ) = 
3.32. Then p i} /q = 27.660, or p = 0.965 and q a = 0.035. For the year 
1940, t = 1965 - 1940 = 25 generations, from which p = 0.911 and q = 
0.089. {You may be interested to know that observations made at the 
time yielded estimates of q - 0.037 in 1950 and 170 = 0.111 in 1940.) 

Application to the Evolution of Insecticide Resistance 

Some of the most dramatic examples of evolution in action result Irom the 
natural selection for chemical pesticide resistance in natural populations of 
insects and other agricultural pests. In the 1940s, when chemical pesticides 
were first used on a large scale, an estimated 7% of the agricultural crops in 
the United States were lost to insects. Initial successes in chemical pest man- 
agement were followed by gradual loss of effectiveness Today, more than 400 
pest species have evolved significant resistance to one or more pesticides, and 
1 37.. of the agricultural crops in the United States are lost to insects (May 
1985). In many cases, significant pesticide resistance has evolved in 5 to 50 
generations irrespective of the insect species, geographical region, pesticide, 
frequency and method of use, and other seemingly important variables (May 
1985). Equations 6.15 through 6.17 help to understand this apparent paradox 
because many of the resistance phenotypes result from single mutant alleles. 
The resistance alleles are often partially or completely dominant, so Equa- 
tions 6.15 and 6.16 are applicable. Prior to use of the pesticide, the allele fre- 
quency /»D of the resistant mutant is generally close to 0. Use of the pesticide 
increases the allele frequency, sometimes by many orders of magnitude, but 
significant resistance is noticed in the pest population even before the allele 
frequency p t increases above a few percent. Thus, as rough approximations, 
we may assume that 170 and q, are both close enough to 1 that In (pn/ifr) = ' n Po 
and In (p l /q,) ~ \h- Using these approximations, Equation 6.16 (additive case) 
implies that t = (2/s) x In (p t /p<.') and Equation 6.15 (dominant case) implies 
that i = (1/s) x In (pi/po)- In many instances, the ratio p,/p may range from t 
xl() 2 to perhaps 1 x 1 7 , and s may typica I ly be 0.5 or greater. Over this wide 

Darwinian Selection 227 

range of parameter values, the time / is effectively 1 united to a range of 5 to 50 
generations for the appearance of a significant degree of pesticide resistance. 
Details in actual examples depend on such factors as effective population 
number and extent ol genetic isolation between local populations An exam- 
ple of the global spread of an insecticide-resistance allele is given in Chapter 
8. The evolution of resistance caused by multiple interacting alleles may be 
expected to take somewhat longer than single-gene resistance. 

PROBLEM 6.5 In the discussion of the evolution of insecticide resis- 
tance, we used the approximation r = (1 /s) x In {pjp n ) for the domi- 
nant case and r s (2/s) x In (p t /p ) for the semidominant case. 
Evaluate the adequacy of the approximations for the values in the 
accompanying table by comparing them with the more exact values 
calculated from Equations 6.15 and 6.16. 

Example no. 




1 x 10- 4 




lxitr 4 




lxhT 4 




1 x irr 7 




ixitr 4 



ANSWER The approximations are quite acceptable for the exam- 
ples. The more exact and approximate values are as follows: 

Example no. Eqn,6JS Approximation Eqn. 6.16 Approximation 




18 5 

18 4 














55 7 






69 1 


An equilibrium value of p in a discrete model is any value for which &p = 0. 
When the allele frequency is at an equilibrium in an infinite population, the 
allele frequency remains the same generation after generation Because real 
populations are finite in size, an allele frequency is subject to chance fluctua- 
tions and so cannot usually remain exactly at an equilibrium value bor any 

228 Chapter 6 

equilibrium, therefore, it is important to consider how the allele frequency 
behaves when it is close, but not exactly equal, to the equilibrium value Any 
equilibrium can be classified as one of several different types according to the 
behavior of the allele frequency when it is near the equilibrium: 

• An equilibrium is said to be locally stable if the allele frequency, when ii 
is already close to the equilibrium, moves progressively closer in subse- 
quent generations. A locally stable equilibrium may also be globally sta- 
ble. This term means that the allele frequency always moves toward the 
equilibrium regardless of where it starts, even if initially far away from 
the equilibrium. A polymorphism with a stable equilibrium is sometimes 
called a balanced polymorphism. 

• An equilibrium is unstable if the allele frequency, initially close to the 
equilibrium, moves progressively farther away in subsequent genera- 

• An equilibrium is called neutrally stable or semistable if the allele fre- 
quency has no tendency to change regardless of its initial value. In such a 
case, every allele frequency represents an equilibrium because Ap = 
whatever the value of p. This type of equilibrium is exemplified by the 
Hardy- Weinberg principle in an infinite population (Chapter 3). 

The concepts of stability can be applied to the case of selection governed 
by Equation 6.11 in which A is the favored allele. For A to be favored, we 
need n> u > w u ^ H'zz* and at least one of the strict inequalities must be true. In 
such a case, there are only two equilibria, namely p = and p = 1 . Except for 
p = and p = 1, when Ap = 0, it is always true that Ap > 0. Hence, if p is close 
to 0, its value increases (moving it farther away from 0), and so the equilibri- 
um at p = is unstable. On the other hand, if p is near 1, it moves still closer to 
1 (because Ap > 0), and so the equilibrium atp = 1 is locally stable. In this 
example, p eventually goes to 1 whatever its initial value, and so the equilib- 
rium atp = 1 is globally stable also. 


With two alleles of a gene in a diploid organism, there is the possibility that 
the heterozygous genotype has the highest fitness or that the heterozygous 
genotype has the lowest fitness. These cases illustrate equilibria in which the 
equilibrium value of p is between and 1. 

Overdominance, also called heterozygote superiority, is the term applied 
when the heterozygote has a higher fitness than both homozygotes. Symbol- 
ically, heterozygote superiority means that w n > w u and simultaneously 
w u > o>22- With overdominance, p = and p = 1 are both equilibria because, 
according to Equation 6.11, Ap - at these values. There is also a third equi- 
librium made possible by the fact that p{w n - u> 13 ) + q[iv l2 - 11*22) can equal 0. 
The equilibrium frequency of A is conventionally denoted p; hence the equi- 


Darwinian Selection 229 

libnum allele frequency of « is 7 = 1 - p The equilibrium can be found by 

solving p(w n - w l2 ) + q(w n ~ w n ) = 0, from which a little algebra gives 

P = 

«'12 - H>22 

2W|2-H>,, -W 2 2 


Equation 6.18 is often encountered in another form in which the fitnesses 
are all expressed relative to that of the heterozygote by setting w n = 1 - s , 
w n = 1, and zv 22 = 1 - /. (This formulation is proposed at the risk of some con- 
fusion because f is now the selection coefficient against aa rather than the 
time in generations.) With these substitutions, Equation 6.18 becomes 

P = 


s + t 

This relationship makes a lot of intuitive sense because it implies that greater 
selection against aa increases the equilibrium frequency p of A. 

The overdominance equilibrium in Equation 6.18 is globally stable where- 
as those at p = and p = 1 are unstable. The time course is indicated in Figure 
6.4A, where the arrowheads show the direction of change in allele frequency. 
Figure 6.4B shows the change in w with overdominance. The average fitness 

20 40 60 80 
Time {l, in generations) 

Figure 6.4 Selection when there is overdominance (A) The allele frequencies 
converge to an equilibrium value irrespective of the initial frequency. In this 
example, w n = 0.9, w n - 1, and w 22 = 0.8, and the equilibrium frequency of the A 
allele, p, is 0.667. (B) Average fitness w against p for the same example. Note that 
w is a maximum at equilibrium. 

230 Chapter 6 

in the population is maximized at the stable equilibrium. Maximization of 
average fitness is a frequent outcome of selection in random-mating popula- 
tions with constant fitnesses. There are, however, many exceptions when 
mating is nonrandom, when the fitnesses are not constant, or when there are 
interactions between alleles of different genes (Ewens 1979; Curtsmger 1984). 
Note particularly that w is the average fitness in the population, not the aver- 
age titness of the population. The relative survivorships it),,, w n , and n> 22 are 
relevant only to the differential mortality of the genotypes within a popula- 
tion at any given lime. The average of the relative survivorships is the aver- 
age "fitness" w in the population. However, w has no necessary relation to 
vernacular meanings of "fitness" such as competitive ability, population size, 
production of biomass, or evolutionary persistence (Haymer and Hartl 1982). 
Although overdominance is one mechanism for the maintenance of pnlv- 

The classic case is sickle-cell anemia in human beings, which is prevalent in 
many populations at risk for the type of malaria caused by the mosquito-borne 
protozoan parasite Plasmodium falciparum (Figure 6.5). The anemia is caused by 
an allele S that codes for a variant form of the fS chain of hemoglobin In per- 
sons of genotype SS, many red blood cells assume a curved, elongated shape 
("sRkling") and die iemo\ed horn uRiiiarion. I he result is a severe anemia as 
well as pain and disability owing to the accumulation of defective cells in the 
capillaries, joints, spleen, and other organs. In the absence of intensive medical 

at a i Ha 1 1 u'h high ficqueni v beiau^e peisonn of genotype AS, in which /\ is the 
nonmutant allele, have only a mild form of the anemia but are quite resistant to 
malaria, perhaps because red blood cells infested with the parasite undergo 
sickling and are removed from circulation. Homozygous AA people are not ane- 
mic but, on the other hand, are the most sensitive to severe malaria. The result 
ol the offsetting sickle-rell anemia and malaria resistance is that the hetero/y- 
gotes have the highest fitness. In regions of Africa in which malaria is common, 
the viabilities of AA, AS, and SS genotypes have been estimated as w n = 0.9, 
u'| 2 = 1, and K'22 = 2, respectively (Cavalli-Sforza and Bodmer 1971; Templeton 
1982) Substitution into Equation 6.18 leads to a predicted equilibrium allele fre- 
quency for A of p - 0.89. Consequently, that of S is 0.11. This value is reasonably 
close to the average allele frequency of 0,09 across West Africa, but there is con- 
siderable variation in allele frequency among local populations. 

PROBLEM 6.6 Experimental populations of Drosophifo pseudoobscu- 
ra were periodically treated with weak doses of the insecticide DDT. 
One population was initially polymorphic for five different inversions 

Darwinian Selection 231 

fn^T 6 '1 I^Tj" 111 y ' a >' areas show the incidence ol falciparum malaria 

cont 7n ' ddle EaSt T ld SOUthem Eur °^ in the m ^ Wi-e mo o 

control programs were implemented. The light gray areas are regions w 

high madence of srckle-cel! anemia. The extensive overlap in ihldK'hib tons 

of the thtrd chromosome. After 13 generations, three of the inversions 
had essentially disappeared from the population. The two that 
remained were Standard {ST) and Arrowhead (AR). Changes in fre- 
quency of each inversion were monitored and, from the values for the 
Jd/VT 6 senerations ' the Native fitnesses of ST/ST, ST/AR and 
frf™ 8e ™ typeS Wefe estimated ^ 0.47, 1.0, and 0.62, respectively 
(DuMouchel and Anderson 1968). Because the inversions undergo 
almost no recombination, each type can be considered as an "allele " 
What equilibrium frequency of ST is predicted? What equilibrium 
value of w is predicted? 

232 Chapter 6 

ANSWER From Equation 6.18, p = (1.0 - 0.62)/ (2.0 - 0.47 - 0.62) = 
0.42, (The observed value after 13 generations was 0.43.) The predict- 
ed equilibrium value of w, from Equation 6.8, equals 0.422 x 0.47 + 2 
x 0.42 x 0.58 x 1.0 4 0.58 2 x 0.62 = 0.78. 

PROBLEM 6.7 Warfarin is a blood anticoagulant used for rat control 
in World War II and afterward. Initially highly successful, the effec- 
tiveness of the rodenticide gradually diminished owing to the evolu- 
tion of resistance among some target populations. Among Norway 
rats in Great Britain, resistance results from an otherwise harmful 
mutation R in a gene in which the normal nonresistant allele may be 
denoted 5. In the absence of warfarin, the relative fitnesses of SS, $R> 
and RR genotypes have been estimated as 1.00, 0.77, and 0.46 respec- 
tively. In the presence of warfarin, the relative fitnesses have been esti- 
mated as 0.68, 1.00, and 0.37, respectively (May 1985). The reduced 
fitness of the RR genotype appears to result from an excessive require- 
ment for vitamin K. Calculate the equilibrium frequency q of R in the 
presence of warfarin. Noting that, in the absence of warfarin, R and S 
are very nearly additive in their effects on fitness, estimate the 
approximate number of generations required for the allele frequency 
of R to decrease from q to 0.01 in the absence of the poison. 

ANSWER From Equation 6.18, the equilibrium frequency p of S 
equals (1.00 - 0.37)/(2 - 0.68 - 0.37) = 0.66, and so q of R * 0.34. Set- 
ting ifo = 034 and q t = 0.01 in Equation 6.16, with t = 1.00 - 0.46 =,0.54, 
yields f - 14.6 generations. (The approximation fs very good wen 
though s is large; the exact value is 14 generations.) 

Local Stability 

Although the curves in Figure 6.4A indicate that Ihe interior equilibrium is 
locally stable when there is overdomi nance, an alternative approach is also 
applicable to the analysis of local stability in models of much greater 

Darwinian Selection 233 

complexity. It is based on the expression for Ap in Equation 6. 1 1 R> empha- 
size that A;> is a function of p, we will write it as an explicit function, A(p). The 
local stability of an equilibrium depends on the behavior or A{f>) for a value of 
p close to, but not equal to, the equilibrium, as illustrated in Figuie 6.6 It is 
convenient to write A(p -t e) as the change in allele frequency when the start- 
ing point is a small deviation, e, from any allele frequency p The function 
A{p + e) can be expanded term by term into an infinite sum 

A0, + f ) = A (/ ,) + ^ + ^^ + ^M^ + ... 
dp dp 2 2! <y 3' 

The mathematical basis of this lype of expansion is beyond the scope of 
the book. If you are unfamiliar with it and want to look it up, you will find it 
under the heading the Taylor series m most textbooks of calculus. It is named 
after the mathematician Brook Taylor (1685-1731). 

The value of the Taylor series expansion is that, when e is sufficiently 
small, then all terms in e 2 and higher can be ignored. Therefore, for any value 


&p a 

-0 1 - 

-0 2 

-0 3 L 


0.2 /'o 1)4 /'i Vz Oft 

Pi = /><> + <V'< 

Figure 6.6 The change in allele frequency Ap plotted as a function of allele fre- 
quency p for a case of overdominance in which w u = 0.6, w l2 = I, and 
H'2? = 0.2 Starting with an allele frequency />„, smaller than the equilibrium 
value, the positive value of Ap„ indicates that the allele frequency in the next 
generation, />|, will be greater than />„ because ;j, = /j„ + &f,„. At an allele frequen - 
cy of /*,, the value of A/>| is also positive, and so p 2 is greater than p, because 
Pi = p, + A/i,. The steady increase continues until the population arrives at the 
equilibrium point p. The same logic shows that, starting with an initial allele 
frequency greater than p, the allele frequency decreases in each succeeding gen- 
eration and ultimately converges to the equilibrium from the other side 

234 Chapter 6 

of p, we can approximate A(p + e) in terms of A(p) itself and its first derivative. 
Furthermore, if p is one of the equilibrium points, then A(p) = by definition, 
and so the sign of A{p + e) depends on the sign of first derivative of A(p) eval- 
uated at the equilibrium in question. By definition, an equilibrium is locally 
stable if the allele frequency, starting at a point near the equilibrium, moves 
ever closer to the equilibrium. In symbols, this means that A(p + e) < if e > 
and A(p 4 e) > if £ < Therefore, any equilibrium point, denoted genencal- 
ly as p, is locally stable if, and only if, 





where the vertical line and p mean that the derivative should be evaluated at 
the equilibrium in question. 

In practice, calculating the derivative of A(j?} can be quite tedious without 
the use of computer software like Mathematica to do the algebraic manipula- 
tions. The result of differentiating Equation 6.11 is that 

pqiv [ (q-p)(p-p)w 

2pq{p-p) 2 w 7 

where w = w n - 2w n + iv n . With overdominance, iv < 0. Note that, when 
dA(p)/dp is evaluated at p = or p = 1 , both the first and last terms equal 0; 
when it is evaluated at p = p, the second and last terms equal The stability 
analysis proceeds as follows: 

• At p = 0, sign \dA(p)/dp] = -sign (w) > 0; 

• At p = p, sign [dA(p)/dp] = sign (w) < 0; 

• At p = 1 , sign [dA(p)/dp] = -sign (w) > 0. 

Therefore, as is already clear from Figure 6.4A, the equilibrium points at 
0, p, and 1 are unstable, locally stable, and unstable, respectively. This stabil- 
ity analysis is predicated on the assumption of helerozygote superiority, 
which implies that w < Exactly the same equilibrium points are present 
when there is heterozygote inferiority, but then w > 0, which means that the 
stability property of each equilibrium point is reversed. This situation is dis- 
cussed next. 

Helerozygote Inferiority 

Helerozygote inferiority means that the fitness of the heterozygous geno- 
type is smaller than that of both homozygotes: w n < Wn and Wn < "'22- An 
interior equilibrium, given by Equation 6.18, exists in this case also. The 
analysis in the previous section indicates that this equilibrium is unstable, 
whereas the equilibria at p = and p = 1 are both locally (but not globally) sta- 

Darwinian Selection 235 

20 40 60 80 100 
Time (t, in generations) 

Figure 6.7 Selection when there is heterozygote inferiority. (A) The allele 
frequency goes to or 1 depending on the initial frequency. In this example, 
w u = 1, w u = 0.8, and m? 22 = 0.9, and there is an unstable equilibrium when the 
frequency of the ,4 allele \sp = 0.333. An infinite population with p = % main- 
tains this frequency, but any slight upward change in the frequency of A results 
in eventual fixation, and any slight downward change in the frequency of A 
results in ultimate loss. (B) Average fitness w against p for (he same example. 
The unstable equilibrium represents the minimum of tT' 

ble. An example of heterozygote inferiority is depicted in Figure 6.7A, where 
the arrows again denote the direction of change in allele frequency. If the 
initial allele frequency is exactly equal to the equilibrium value (in this exam- 
ple, p = V 3 ), then the allele frequency remains at that value. In all other cases, 
p goes to 1 or depending on whether the initial allele frequency was above 
or below the equilibrium value 

Figure 6.7B shows the change in average fitness. The unstable equilibrium 
at p = y, i s the minimum average fitness. The shape of the r7> curve has an 
important implication that carries over to more complex examples. Imagine a 
population with an allele frequency near 0, at which iT> = 0.9, In terms of aver- 
age fitness in the population, the population would be better off if the allele 
frequency were near 1, because then w = 1 .0. However, as shown by the direc- 
tion of the arrows, the population cannot evolve toward p = 1. It cannot get 
through the "valley" because p ' = 6 is a locally stable equilibrium. The popu- 
lation has no way to escape from the" equilibrium even though, in doing so, it 
would eventually end up with a greater average fitness. This consideration 

236 Chapter 6 

would seem to limit the ability of natural selection to increase average fitness 
in such cases, but one way out o( the impass is suggested in the next section. 

The Adaptive Topography and the Role of Random Genetic Drift 

Any graph of w against allele frequency is called an adaptive topography. 
The simplest example is Figuie 6 7B In order to generalize the example, try 
to imagine an adaptive topography in many dimensions with w a function 
of the allele frequencies at many loci. In many dimensions, the adaptive 
topography is a complex surface upon which there may be "peaks" and 
"pits" and even "saddle-shaped" regions. The peaks represent locally stable 
equilibria. Even if natural selection changes the allele frequencies so as to 
move w to the top of some peak, the peak it perches on may not be the high- 
est peak that exists on the whole surface. However, as illustrated in Figure 
6.7B, the population may become stuck there because the peak is a locally sta- 
ble equilibrium. 

By what process can a population stranded on a submaximal fitness peak 
get off the peak? To do so, it has to travel through a nearby valley to a place 
where natural selection can carry it to the top of an even higher fitness peak. 
This is something that natural selection acting alone cannot accomplish because 
it entails a temporary reduction in fitness. There is, however, a process that can 
accomplish the task— random genetic drift. In a sufficiently small population, 
the allele frequencies can change by chance, even producing a reduction in aver- 
age fitness. Theoretically, random genetic drift can shift a population from a 
locally stable equilibrium, through a nearby valley, and into a region where it is 
attracted by another locally stable equilibrium toward a higher fitness peak. 
Random genetic drift can therefore play a crucial role in evolution by allowing a 
population to explore the full range of its adaptive topography. This role of ran- 
dom genetic drift has been particularly emphasized by Wright (1977 and earli- 
er) in his proposed shifting balance theory of evolution. Additional discussion 
of the theory is found in this chapter's section on interdemic selection; see also 
Hartl (1979), Provine (1986), and Coyne et al. (1997) 


You may recall from Chapter 4 that outcrossing species typically contain a 
large amount of hidden genetic variability in the form of recessive, or nearly 
recessive, harmful alleles, each present at a low frequency. Now we can 
explain why harmful alleles are not completely eliminated. Selection cannot 
eliminate them because they are continually created anew through recurrent 
mutation. To be specific, suppose that a is a harmful allele of the wildtype A 
and that mutation of A to a takes place at the rate u per generation. Because 
the allele frequency of'?, which we call q, remains small, reverse mutation of 
a to A can safely be ignored. The calculation of p' carried out to obtain 

Darwinian Selection 237 

Equation 6 10 is still valid, except that a proportion ii of A alleles mutate to a 
in each generation. Therefore, 

To proceed further, it is convenient tn write the relative fitnesses as 

6 20 

w v = 1 w ]2 = 1 - Its 

.= !-.«: 

The value of s is the selection.coefiicient against the homozygous aa geno- 
types and ft is the degree ^dominance of the a allele. If ft = 0, then a is a com- 
■ pTete recessive Because AA and Aa have an identical fitness. If ft = 1, then a is 
dominant because Aa and aa have an identical fitness. Semidominance means 
that ft = % In mutation -selection balance, we are concerned with harmful 
alleles that are near the recessive end of the spectrum, and so ft will usually 
be substantially smaller than 0.5. 

Equilibrium Allele Frequencies 

When selection is balanced by recurrent mutation, there is a globally stable 
equilibrium at an allele frequency of p, which is the value of p in Equation 
6.20 for which p'= p. The equilibrium frequency of the harmful a allele is 
therefore q = l- p. There are two important cases. 

• When the harmful allele is a complete recessive (ft = 0), then 

v s 

6 21 

• When the harmful allele shows partial dominance (ft > 0), then, to an 
excellent approximation for realistic values of p, ft, and s, 

Use of these equations is exemplified by Huntington disease in human 
beings This severe inherited disorder is characterized by a degeneration of 
the neuromuscular system that typically appears after age 35. Although the 
disease itself results from a dominant mutation, the effects on fitness show 
only partial dominance owing to the late age of onset of the disease. Relative 
to a value of w n = 1 for the homozygous nonnuitant genotype, the fitness of 
the heterozygous genotype has been estimated as w i2 = 0.81 (Reed and Neel 
1959). Homozygous mutant genotypes also have the disease, but they are so 
rare that the equilibrium frequency of the mutant allele is determined by the 
fitness of the heterozygote. Equation 6 22 with fts = 0.19 is appropriate in this 
example. If we knew either u or q, we could estimate the other. In a Michigan 

238 Chapter 6 

population, ? = 5 x lO"" 5 for the Huntington allele (Reed and Neel 1959). 
Assuming that the population is in equilibrium, we can estimate u from 
Equation 6 22 as p = 5 x 10 5 x 19 = 9.5 x 1 0" '\ This use of Equation 6.22 illus- 
trates one of the common indirect methods for the estimation of mutation 
rates in human beings. 

The degree of dominance of a harmful allele is a primary factor in deter- 
mining its equilibrium frequency. Harmful alleles held in mutation-selection 
balance are rare. Thus the great majority of harmful alleles are present in 
heterozygous genotypes. Because there are so many heterozygous geno- 
types, relative to homozygous mutant genotypes, even a small reduction in 
fitness in the heterozygote has a large effect in decreasing the equilibrium 
allele frequency. This effect is shown quantitatively in Figure 6.8, which 
depicts (j as a function of u/s and h. Note how the surface bends sharply 
upward at the far-right corner where ft = 0. The increase indicates that, for a 
given value of u/s, a completely recessive allele is maintained at a higher 
equilibrium frequency than a partially dominant allele. Furthermore, the 
surface drops sharply as h increases from 0, which means that even a small 


io-» 05 

Figure 6.8 Allele frequencies maintained at equilibrium by mutation-selec- 
tion balance At each point on the surface, the height is the equilibrium frequen- 
cy a of a harmful allele, given as a function of the mutation rate p (expressed in 
multiples of the selection coefficient s) and the degree of dominance h. Note that 
the surface bends sharply upward toward /; = 0, a characteristic that means that 
even a small degree of dominance results in a substantial decrease in the equi- 
librium frequency of the harmful allele. The u/s axis is easiest to interpret when 
the harmful allele is a lethal (s = 1). 

Darwinian Selection 239 

degree of dominance can cause a large reduction in equilibrium frequency. 
In general, for realistic values of p, s, and h, the value of q is typically less 
than 01 . Therefore, although mutation-selection balance can account for 
low-frequency deleterious alleles, it cannot readilv account for a harmful 
allele with a frequency greater than 11.01. 

PROBLEM 6.8 To confirm for yourself that a small amount of dom- 
inance can have a major effect in reducing the equilibrium frequency 
of a harmful allele, imagine an allele that is lethal when homozygous 
(s - 1) in a population of Drasophila. Suppose that the allele is main- 
tained by mutation-selection balance with p = 5 x KT 6 Calculate the 
equilibrium frequency of the allele for a complete recessive and for 
partial dominant when h = 0.025. 

ANSWER For a complete recessive, q - Vp/s =V(5 x 10^) = 2.24 x 10 3 . 
For partial dominance, q = p/hs = (5 x ]0" 6 )/0.025 = 2.00 x 10~ 4 . With 
partial dominance, the equilibrium allele frequency is reduced more 
than tenfold, and the frequency of homozygous recessive genotypes 
at equilibrium is reduced more than a hundredfold. It is of interest 
that h = 0025 is near the average degree of dominance estimated for 
"recessive" lethals in Drosophila (Simmons and Crow 1977). 

The Haldane-Muiler Principle 

The Haldane-Muller principle, named after the geneticists ]. B S Haldane 
(1892-1964) and H. J. Muller (1890-1967), deals ivith the effect of mutation- 
selection balance on the average fitness of a population. Ignoring recurrent 
mutation, selection would be able to rid a population completely of a harm- 
ful allele. Then, ij = 0, and re = 1. Because of recurrent mutation, the equilib- 
rium frequency is greater than 0. When h = 0, the average fitness in the 
population at equilibrium equals 1 - r/"s = 1 - (p/s)s = 1 - p. The reduction in 
average fitness due to mutation therefore equals 1 - (1 - p) = u, which is 
called the mutation load When a is partially dominant, the mutation load is 
approximately 2p because the average fitness at equilibrium is 1 - Ifu'jhs - <f* 
= 1 - 2p. This result is obtained by ignoring terms in ff because they are so 
small. With or without partial dominance, therefore, the effect of recuirent 
mutation in reducing the average fitness in the population is independent 
of how harmful the mutation is. That the effect of recurrent mutation on 

240 Chapter 6 

average population fitness depends only on the mutation rate is the Haldane- 
Muller principle. The implication is that the harmful effect of an increase in 
the mutation rate is the same irrespective of whether the mutations produced 
are mildly detrimental or severely harmful. The effects of severe and mild 
mutations balance out because a more harmful mutation comes to a lower 
equilibrium frequency. 


Although the two-allele model of viability selection illustrates the possible 
outcomes of selection, it ignores many potential complications. For example, 
when the genotypes differ in fertility rather than survivorship, then the 
model of viability selection is inadequate except in special cases. Most muta- 
tions have pleiotropic effects; that is, they affect more than one phenotypic 
attribute of the organism. For example, a' gene affecting embryonic growth 
rate may also affect age at first reproduction. When the pleiotropic effects act 
in opposing directions (for example, increasing viability but reducing fertili- 
ty), the net effect on fitness may be quite small. As a result, mutations with 
offsetting effects on different components of fitness may remain segregating 
in a population for many generations. 

Additional complications arise because fitness is determined by many 
genes that interact with each other. Simple models of selection are valid only 
when the alleles interact in such a way that their effects on fitness are addi- 
tive or multiplicative across genes. Other complications result when the fit- 
nesses of the genotypes are not constant but variable in time or space. In this 
section we briefly examine a sample of more complex models. Many of the 
models are of interest because they can maintain genetic polymorphisms. 
Although the list is extensive, it is by no means complete. You should not try 
to memorize all the different types of selection. They are collected here only 
for ease of reference. 

Frequency-Dependent Selection 

Frequency-dependent selection takes place when fitness is a function of 
either allele frequencies or genotype frequencies. There is no restriction on 
the type of frequency dependence except that each Darwinian fitness must 
be nonnegative. A simple example that illustrates frequency dependence is 
one in which the fitness of each genotype decreases in proportion to its fre- 
quency with a constant of proportionality equal to c: 

AA- w n =\-q> 2 Aa: W l2 =^-2cpq aa: w 22 = \-cq 2 

In this example, Ap = cpqiq - p)(p 2 -pq + (j 2 )/w, and so there are equilibria 
at p = 0, V 2 , and 1. (The factor p 2 - pq + q 2 does not have a root for p in the 
range [0, 1].) A curious feature of this type of frequency-dependent selection 

Darwinian Selection 241 

is that, at equilibrium, w i2 is smaller than either re,, or w 22 , so there is het- 
erozygote inferiority; yet p = i/ 2 is a globally stable equilibrium and w is a 
maximum at this equilibrium. The peculiarities of this example are illustra- 
tive of frequency-dependent selection in general. Because the fitnesses can be 
any functions of allele or genotype frequency nearly anything can happen 

Density-Dependent Selection 

Density-dependent selection means that the fitnesses are functions of the 
population size. Models of density-dependent selection must explicitly 
include population size and population growth. With logistic growth of two 
haploid genotypes whose numbers at time t are A(t) and B(t), Equation 1 11 
in Chapter 1 becomes 


\ *M J dt \ K 2 

Each genotype has its own intrinsic rate of increase (r, or r 2 ) and its own 
carrying capacity (K, or K 2 ), but they affect each other's growth through the 
total population size A(t) + B(t). At any time, the outcome of selection 
depends on the total population size. When the population size is much 
smaller then either K x or K 2 , then the right-hand factor in each growth equa- 
tion equals approximately 1, and so the selection is determined by the rela- 
tive values of r Y and r 2 . When the population size becomes approximately 
equal to the smaller of K x or K 2 , then the genotype with the smaller carrying 
capacity stops growing while the other continues, and so the selection is 
determined by the relative values of K x and K 2 . Interesting events happen 
when the selection for r favors one genotype and the selection for K favors 
the other, especially in situations in which stochastic factors also affect pop- 
ulation size or there is a time lag between population size and its affect on 
growth rate. For further information on these types of models, see Rough- 
garden (1979), May (1981), Bulmer (1994), and Cohen (1995). 

Fecundity Selection 

In fecundity selection, differences in fitness between the genotypes result 
from the differing abilities of mating pairs to produce offspring. Because both 
genotypes in a mating pair contribute to the total number of offspring, the 
number of fitness parameters potentially equals the number of distinct kinds 
of mating pairs. For two alleles of one gene, there are nine possible types of 
mating because reciprocal matings may differ in the expected number of off- 
spring; for example, the expected number of offspring from the mating 
Aa 9 x aa c? may differ from that from the mating Aa 8 x aa 9 . The presence 
of so many fitness parameters complicates the mathematical analysis. An 
analysis of selection based on individual genotypes, analogous to viability 
differences, is not possible unless the overall fecundi ly of any mati ng pair can 

242 Chapter 6 

be written as either the product or the sum of two parameters, one for each 
genotype in the mating pair. When this strong simplification does not hold, 
models of selection with fertility differences become rather complex (Ewens, 
1979; Clark and Feldman 1986). Models in which differences in fecundity are 
combined with differences in survivorship can retain genetic polymorphisms 
even if there is directional selection in one or the other component of fitness. 

Age-Structured Populations 

Age-structured populations with overlapping generations present problems 
even more formidable than those caused by fecundity and survivorship dif- 
ferences in populations with discrete, nonoverlapping generations In each 
short interval of time, a new cohort of newborns comes into existence and, 
as it ages, the fate of each organism in the cohort is governed by the functions 
/( v), which is the probability of survival from birth to age x, and b{x), which is 
the probability that an organism of age x (actually in the infinitesimal age 
interval x to x + tix) reproduces. If the functions l{x) and b(x) maintain the 
same form over time, then it can be shown that the population eventually 
reaches a stable age distribution in which the number of organisms in each 
age group increases or decreases at a constant rate. At the stable age distribu- 
tion, the overall growth rate of the population is the value of m that satisfies 
the equation: 

1= fc-<"I(x)h(x)ih 

(See Crow and Kimura, 1970, for a derivation.) For this value of m, 
dN/dt = mN, where N is the total population size. In an age-structured popu- 
lation, hi corresponds to the intrinsic rate ol increase denoted r in Equation 
1.7 in Chapter 1. 

So far so good, but genetics complicates this situation enormously. Tf the 
l(.v) and b{x) functions differ for different genotypes, then the allele frequen- 
cies change through time As the allele frequencies change, so does the age 
structure, and the genotype frequencies in each age class may be different. 
The result is that the age structure may not become stable until selection 
reaches some equilibrium (possibly fixation) The sorts of complexities that 
can arise have been examined by Charlesworth (1980) 

Heterogeneous Environments and Clines 

Heterogeneous environments refer to models in which the relative fitnesses 
change according to the environment. The environmental heterogeneity may 
be spatial or temporal or both. Selection of this type can maintain polymor- 
phisms in the absence of overdominance. If each homozygous genotype is 
favored in a different subset of environments, then there can be marginal 

Darwinian Selection 243 

overdominance, in which the heterozygous genotype has the highest fitness 
when averaged across all the environments, even though it is not the most fit 
genotype in any particular environment. 

In some cases, the relative fitnesses of the genotypes vary geographically 
across a more or less smooth environmental gradient, for example, according 
to latitude, altitude, aridity, or salinity. If sufficiently stable in time, a gradient 
of selection across a region can result in a gradient of allele frequency across 
the region. A geographical trend in an allele frequency is called a cline. An 
unusually extreme example of a cline is found in the hemoglobin-! 1 allele in 
the eelpout fish Zwirces mviparus, the allele frequency of which drops from a 
value of nearly 1 in the North Sea to a value of nearly in the Baltic Sea 
(Christiansen and Frydenberg 1974). In human aboriginal populations, there 
is a cline of increasing frequency of the allele /" in the ABO blood groups 
from Southwest to Northeast Europe. 

Although clines can result from selection— for example, when one geno- 
type is favored at one extreme of the environmental gradient but disfavored 
at the other extreme — clines can also result from other processes Migration is 
one possibility: differences in allele frequency in local populations at the 
extremes of the range may result from chance processes (for example, differ- 
ent founding populations), and migration of organisms from the extremes 
into the intermediate zone produces the cline. 

The strongest evidence that a cline results from selection is when a cline is 
reproduced in different locations along a similar environmental gradient. A 
example of parallel clines played out on a grand scale is found in the elec- 
trophoretic polymorphism of alcohol dehydrogenase (the Adh gene) in D. 
melanogaster. In Eastern North America, the frequency of the Aril/ allele 
increases as one goes north, whereas DNA polymorphisms flanking Adh 
show no such geographic trend (Berry and Kreitman 1993). The cline is 
shown in the upper part of Figure 6.9. The frequency of Adh' is correlated 
with cooler temperatures and less rainfall in the more northern latitudes In 
Australia, as shown in the lower part of Figure 6.9, the frequency of the Adh' 
allele increases as one goes south (Oakeshott et al. 1982). This pattern is in 
apparent contradiction to that in Eastern North America but, because Aus- 
tralia is in the Southern Hemisphere, the clines are actually parallel Both 
show an increase in the frequency of Adh' as one proceeds from the equator 
toward the polar cap— the North Pole in the Northern Hemisphere and the 
South Pole in the Southern Hemisphere. On a much smaller geographical 
scale, in mountainous regions, the frequency of the Adh' "allele shows a clinal 
increase with altitude, which is again correlated with cooler temperature and 
less rainfall. Data from the Caucasus Mountains (Grossman et al. 1970) have 
been discussed in Problem 4.2; parallel clines have also been studied in the 
mountains ol Mexico (Pipkin et al. 1976). 

244 Chapter 6 

9 30 

Eastern North America 

4 06 0.8 1 12 

Frequency of Adh' (arcsin >/p) 



-50 L 

Fiqure 6.9 Parallel dines of the Adh' (alcohol dehydrogenase fast) allele in East- 
ern North America and in Australia. The allele frequency is given as arcsin(>/p), 
where p is the allele frequency of Adh r . The angular transformation stretches the 
scale near the extreme values of p: for values of p - .1, 0.5, and 9, the values of 
arcsinf^) are 0.322, 785, and 1 .249, respectively, where the angles are mea- 
sured in radians The angular transformation is often used for proportions 
because it separates the variance of an estimate from the estimate itself: for a 
binomial proportion phased on n observations, the variance of p isp(l -p)/n, 
whereas the variance of arcsin(^p), with the angle expressed in radians is 
approximately 1/4h. (North American data from Beiry and Kreitman 1993; Aus- 
tralian data from Oakeshott et al. 1982 ) 

Diversifying Selection 

The term diversifying selection refers narrowly to selection that favors 
extreme pbenotypes. In a normal distribution of phenotypes, for example, 
diversifying selection means that organisms in the tails of the distribution are 
favored relative to those in the middle. More generally, diversifying selection 
refers to any type of selection in which genotypes are favored merely because 
they are different. Genes under diversifying selection tend to maintain a 

Darwinian Selection 245 

relatively large number of alleles. Examples include genes of the major 
histocompatibility complex in mammals, in which the srl- ., ., L > agent is 
thought to be through resistance to parasitic microorganisms (Satta et al 
1993) and bacterial genes that produce toxins (cnlicins) that kill other bacte- 
ria, in which the selective agent is the destruction of competitors (Riley 1993 
Ayala eta 1,1994). 

Some plants have genes for gametophytic self-incompatibility, in which 
a pollen grain that carries any self-incompatibility allele is unable to pollinate 
a plant that carries the same allele Self-incompatibility of this type implies 
that no plant can fertilize itself. Because a plant of genotype S,S, can produce 
only S, and S, pollen, the pollen cannot fertilize S,S ; plants. Furthermore, 
homozygous genotypes are not normally lound because their formation 
would require that S, pollen fertilize an S.S, plant. It is easy to show that there 
is positive selection for new self-sterility alleles and that, at equilibrium, 
every allele has the same frequency. For u alleles, if S, has frequency p„ then 
the frequency of Sfij genotypes with random mating is 2p,(l - p,)/(l - Zp, 2 ). 
The denominator is necessary because of the absence of homozygous 
genotypes. The probability that an S, pollen can be successful in fertilization 
is therefore the probability of genotypes other than S,S,, which equals 
1 - 2p,(l - p,)/n - Ep, 2 ). At equilibrium, we must have p,f I - p) = p,(l - p y ) 
From these expressions follow some important conclusions summarized in 
Problem 6.10. For more information on gametophytic self-incompatibility 
systems, see loerger et al (1991 ) and Uyenoyama (1995). 

PROBLEM 6.9 Show that p,{l - p,) = p/1 - p,) for all i and; implies that 
pi = P, = 1/rc, where n is the number of self-incompatible alleles and n 2 
3. Use these equilibrium allele frequencies to show that the probability 
that a pollen grain lands on a compatible style equals (« - 2)/n. Finally, 
show that the probability of successful fertilization by a new mutant S 
allele, relative to that of any preexisting allele, equals n/(n - 2). 

ANSWER p,(l ~ p,) = p/1 - P/ ) implies that p t -p t = p, 2 - p}= (p, - Pf ) 
(p, + pj) so that either p, = Pj for all i and ;' or p, + p ; = 0. Because n > 3, 
(p; + pj) * 1. Because there are n alleles, we must have lp, - 1, and so p, 
= l/n. The probability of a pollen grain landing on a compatible style 
is 1 - 2p,{l ~ p,)/(l - 2p ( 2 )= 1 - 2/ n = (m - 2)/n. A pollen grain contain- 
ing a newly arising S allele will always land on a compatible style, 

246 Chapter 6 

and so its probability of fertilization, relative to that of a preexisting 
allele, equals 1 /[(« - 2)/n] = n/(n - 2). If effect, this is the relative 
fitness of a new mutation. For n = 3, 4, 5, 10, 50, and 100, it equals 3, 2, 
1.67, 1.25, 1.04, and 1 .02, respectively. 

Differentia! Selection in the Sexes 

Some genes may have different effects in the two sexes. If the fitnesses of 
genotypes differ between the sexes, then genotypes that are disfavored in one 
sex may be favored in the oiher. The offsetting effects increase the opportuni- 
ty for a balanced polymorphism. The survivorship model of selection can be 
extended to include this case by supposing that the relative viabilities of the 
genotypes AA, Aa, and aa are given by u>u, io n , arid io 22 in females and by v lu 
v X2r and v 22 in males. One of the w's and one of the v's can be set arbitrarily to 
1 , which leaves four litness parameters rather than two. A more serious com- 
plication is that the allele frequencies in gametes are no longer the same in 
males and females. Letting p, and p m be the allele frequency of A in female 
and male gametes, respectively, then the genotype frequencies of AA, Aa, and 
aa in the zygotes are p f p m , p f q m + q,p m and q f q m , respectively, where q t =l-pt 
and q m = 1 - p m One of the consequences of differential selection in the sexes 
is that, with an appropriate choice of fitnesses, it is possible to have more than 
one stable polymorphic equilibrium. A stable equilibrium is also possible 
with heterozygote inferiority in one sex or with incomplete dominance when 
selection works in opposite directions in the two sexes. 

X-iinked Genes 

Genes located in the X chromosome can have the same sort of complications 
as differential selection in the sexes, but the possibilities for polymorphism 
are not quite so numerous because there are only three fitness parameters 
instead of four. If A and a are alleles of an X-linked gene, then there are three 
genotypes in females (AA, Aa, and aa) and two genotypes in males (either A 
or a along with the Y chromosome) One fitness parameter in each sex can be 
set arbitrarily to 1. As with differential selection in the sexes, the allele fre- 
quencies differ in eggs and sperm. However, in any generation, the frequen- 
cy of A in male zygotes equals the frequency of A in female gametes of the 
preceding generation. If you do not understand why, think about the parental 
origin of the X chromosome in a male. 

Gametic Selection 

Many plants go through a life cycle in which both haploid products of meio- 
sis and the diploid products of fertilization are exposed to selection. In 

Darwinian Selection 247 

mosses and vascular plants, for example, a diploid organism {the s)wmphi/lc) 
produces spores each of which germinates to form a haploid organism (the 
gametophyte) that reproduces asexually by mitosis. The gametophyfes give 
rise to haploid male and female gametes, which undergo fertilization creat- 
ing a new diploid generation. In mosses, the prominent stage of the life cycle 
is the gametophyte whereas, in higher plants, the prominent stage is the 

When the haploid phase of the life cycle is exposed to selection, the selec- 
tion is called gametic selection. As a concrete model, suppose that the rela- 
tive survivorships of A and a gametophyles (the haploid phase) aie given by 
Vi and v 2 , respectively. In the sporophytes (the diploid phase), the survivor- 
ships can be written as before as iv n , w n , and to 22 . Jf p and q are the allele fre- 
quencies of A and a at the beginning of the haploid phase, then after the 
differential haploid mortality has taken place, the frequencies will be p* = 
pvi/v and q* = qv 2 /v, where v = pu, + qv 2 With random fertilization among 
the gametes, the diploid genotypes AA, Aa, and aa are formed in the propor- 
tions p* 2 , 2p* 2 q* 2 , and q* 7 , and these survive in the relative proportions w n , 
iv u , and w 22 . You may verify for yourself that, at the beginning of the haploid' 
phase of the next generation, the allele frequency of A is 

P = 

p'w n vf+pqw l2 v^v 2 
p 2 w n v 2 + Zpcjitf u v i v 2 +q 2 W 2 2V 2 

This equation has the same form as the equation for p' in Equation 6 10 
except that iv u is replaced with w u Vj 2 , w n with w u r>,*' 2 , and w 12 with w 12 v 2 \ 
The conditions for fixation or for a stable or unstable equilibrium are there- 
fore determined by the relative magnitude of the composite "fitness" of the 
heterozygous genotype relative to those of the homozygous genotypes. 

Meiotic Drive 

A situation analogous to, but distinct from, gametic selection takes place 
when there is non-Mendelian segregation in the heterozygous genotype. In 
females, unequal recovery of reciprocal products of meiosis can be caused by 
nonrandom segregation of homologous chromosomes to the functional egg 
nucleus, which is why non-Mendelian segregation is known genencally as 
meiotic drive. In other cases, the unequal recovery is caused by a gene or 
genes that act to render gametes carrying the homologous chromosome non- 
functional. Examples include "sperm killers" such as segregation distortion in 
Drosophila melanogaster (Charlesworth and Hartl 1978) and'the / alleles in the 
house mouse (Hammer and Silver 1993) as well as "spore killers" described 
in filamentous fungi (Raju 1994). 

Because meiotic drive acts only in the heterozygous genotype, its effect is 
to alter the term pqio n in Equation 6.10 for p'. This term comes from the 
expression V 2 x 2pqw n for the proportion of A -bearing gametes from surviv- 

248 Chapter 6 

ing An genotypes, and the 'A is the Mendel inn segregation ratio If the ratio of 
A • a gametes from An hetero/ygotes is k ■ 1 - k instead of '/ 2 : '/ 2 , then the 
expression for // becomes 

, p 2 w u +2kpqu> [2 

6 23 

where fS is the average survivorship in the population defined in Equation 
6 8. Since A is the driven allele, k > '/ 2 . Equation 6.23 is illustrative of meiotic 
drive even though it requires that the non-Mendehan segregation affect both 
sexes equally, a case that is not generally found in practice One implication 
of the equation is that, unless selection counteracts the meiotic drive, the dri- 
ven allele goes to fixation. In particular, if the relative viabilities are equal, 
then p' = p 1 - +■ 2kpq and Ap = pq(2k - 1), so that p -» 1 because k > V 2 . 

In some examples of meiotic drive, including segregation distortion and the 
t alleles, the driven allele is lethal when homozygous (Hartl 1970). Assum- 
ing that the lethality is completely recessive, the survivorships are w u = 0, 
ri',2 = 1, and w 12 - 1. Equation 6.23 implies that p' = 2kp/(1 + p) and so Ap = 
p[{2k - 1) - p]/(l + p). There is an interior equilibrium at p = 2k - 1, which 
intuition suggests (correctly) is locally stable. It is also globally stable (Figure 
6.10). Note that p is between and 1 for any value of k between V 2 and 1 The 
calculations for a recessive-lethal driven allele are a special case of the slight- 
ly more general model discussed in Problem 6.11. 

PROBLEM 6.10 Suppose that the AA genotype is not completely 
lethal but that its survivorship is given by 1 - s relative to a value of 1 
{or both Aa and aa genotypes. Show that Ap = pq[(2k - 1) - ps]/ 
(1 - ph). Find p and define the conditions, in terms of k and s, for 
which p is between and 1. Show also that the equilibrium is locally 

AMSWER Equation 6.23 implies that p' = [p\\ -s) + 2kpq]/(\ - ph). 
Ap = p' ~p simplifies to the formula given. Setting Ap~0 yields equi- 
libria at 0, 1, and p = (2k - l)/s. For p > Q, we need (2k - I)/s > 0, or k 
> y 2 . For p < 1, we need (2k - l)/s < 1, or k < (s + l)/2. Note that, as 
the selection against the A allele becomes smaller (s closer to 0), more 
values of k result in fixation of the unfavorable A allele and fewer 
result in an interior equilibrium. The stability of p can be deduced by 
evaluating the derivative in Equation 6.19. For this purpose, it is 

continued on page 250 

Darwinian Sefection 249 

( <>nl V 

i('n = rr ]2 = i , w-,-, =Hlfi 

(Q) Meiotic drive only 

(C) Viability and meiotic d live 

1 r- 

05 - 

A V 

Figure 6.10 The balance between meiotic drive and viability selection (A) Av 

dJ Z zuZTZ SS0S / k V1 f b : ht y sdGCli ™ ^uld eliminate the a allele <B> Meiotic 
ZZ< 1 '^ u C hctcroz re" l,s g^^YPe Aa produces 40% /1-bearing 
ElT.n''; Ing gam u tCS ' WUh meiotic driw **™> l ^ A alkie 

Z It } ( ^u P WrSUS P When b ° th Viablli ^ sd ^ n ™* ™^ drive 
are operating at the same time, using the same fitness and meiotic drive para- 
meters as above In this example, when both processes operate simultaneously, 
their offsetting clfecls create a stable polymorphism 

250 Chapter 6 

convenient to write Ap as pcjs{p - p)/(l - p 2 s). In taking the deriva- 
tive, remember that any term containing p-p becomes when p - p, 
so these terms can be neglected. The derivative, evaluated at p, equals 
-pq s/(l - p 2 s), where q - 1 - p. The sign of this number must be neg- 
ative, and so the equilibrium at p, when it exists, is locally stable. 

Multiple Alleles 

The presence of multiple alleles complicates the analysis of selection be- 
cause the number of fitness parameters increases. Wilh n alleles, there are 
n(n + l)/2 possible genotypes, each with its own fitness. Furthermore, sim- 
ple generalizations from two-allele theory do not necessarily carry over to 
multiple alleles. Consider the example of heterozygote superiority. Intu- 
itively, one might expect that fitnesses yielding stable, multiple-allele poly- 
morphisms would be easy to generate by requiring that each heterozygous 
genotype have a greater fitness than the homozygous genotypes formed 
from the constituent alleles. This is not the case, however. If, for n alleles, 
the fitnesses of the genotypes are assigned at random between and 1, sub- 
ject to the condition that, for each i and ;', w j; > max(n»„, w,j), then only a rel- 
atively small proportion of systems with four or more alleles yields a stable 
polymorphism with all alleles present. For four, five, and six alleles, the 
percentage of fitness sets yielding a stable equilibrium is 12.6, 1.2, and 0.03, 
respectively (Lewontin et al. 1978). The reason for the low percentages is 
that, even if a heterozygote is more fit than its constituent homozygotes, 
there might be a different homozygote more fit than all three. All right, how 
about requiring that each heterozygote be better than every homozygote 7 
Surprisingly, this requirement does not hplp matters much. In this case, for 
four, five, and six alleles, the percentage of fitness sets yielding a stable 
equilibrium is 34.3, 10 4, and 1.3, respectively (Lewontin et al. 1978). The 
point is that polymorphisms with greater than three or four alleles are 
extremely unlikely to be maintained by selection for simple heterozygous 
advantage with constant survivorship. If selection is implicated in such a 
case, models of selection such as diversifying selection or heterogeneous 
environments are much more plausible. On the other hand, the fitnesses of 
genotypes in nature are not chosen simultaneously by a random number 
generator. Each new allele that arises is tested against the resident alleles, 
and the new allele is able to invade the population if its marginal fitness 
exceeds the mean fitness of the population. By this process, multiple allele 
polymorphisms can be accumulated, and the order in which the mutations 
appear makes <\ difference (Spencer and Marks 1988). 

Darwinian Selection 251 

The possibility of multiple alleles also creates surprising situations in 
which the outcome of natural selection depends on the order in which the 
alleles are introduced into the population. Earlier in this chapter we men- 
tioned the sickle-cell hemoglobin polymorphism in Africa and its relation to 
malaria resistance. People who are homozygous A A for the normal allele are 
susceptible to falciparum malaria, those who are heterozygous AS for the 
sickle-cell allele are resistant to malaria and have a mild anemia, and those 
who are homozygous SS for the sickle-cell allele have a life-threatening ane- 
mia. This is a classic case of heterozygote superiority There is another allele, 
C, found at low frequency in populations in which the S allele is prevalent. 
The C allele is also protective against malaria, but the allele is recessive, and 
so only the CC genotypes are resistant. Unlike the S allele, the C allele does 
not cause anemia. 

The relative survivorship of each of the various hemoglobin genotypes 
has been estimated based on studies of more than 32,000 people in 72 popu- 
lations in West Africa (Cavalli-Sforza and Bodrner 1971). The survivorships 
are given in the following table, which indicates the genotypes that are resis- 
tant and those that have severe hemolytic anemia. The survivorships were 
estimated in a geographical region where malaria was common Note that 
the S allele causes a severe anemia in the heterozygous SC genotype, but not 
so serious as that in the homozygous SS genotype. 

Genotype AA AS SS AC SC CC 

Health status 

9 1.0 







Inspection of these survivorships reveals a paradox. The CC genotype has 
the highest fitness, yel the C allele is not fixed. The reason is found in the 
historical order in which the S and C mutations took place. The A allele is 
the ancestral type and undoubtedly predated the human settlement of 
regions subject to malaria. In such a region, the appearance of an S allele cre- 
ates a heterozygous advantage, and natural selection quickly attains a stable 
equilibrium at which the ratio of A : S alleles is approximately 8 . 1. At this 
equilibrium, the average fitness in the population is w = 0.911 . Now suppose 
that mutation or migration were to introduce a small number of C alleles. 
Because C alleles are rare, each is present in either the AC genotype, with 
probability %, or in the SC genotype, with probability %. The average fitness 
of genotypes heterozygous for C is therefore 878, which is smaller than the 
average fitness in the population. Hence, the frequency of C decreases, and C 
goes extinct. The C allele has no chance of invading an A/S polymorphism 
unless the initial frequency of C is sufficiently large. Figure 6,11 illustrates 
this phenomenon. With the survivorships given in this example, the critical 
initial frequency of C that allows invasion is 0.073 Once C can get established 
in the population, it eventually becomes fixed. 

252 Chapter 6 


nor>/ oos oi 0.12 

Frequency of C allele 

-0 01 
-0 02 
-0 (B 

Figure 6.1 1 Change in frequency of the hemoglobin C allele in a population 
in which the A and S alleles are present in their equilibrium proportions of 8 : 1. 
When the initial frequency of C is small, the change in frequency is negative, 
and so C is eliminated even though CC genotypes have the highest fitness. The 
C allele is unable to invade unless its initial frequency is greater than 0,073, and 
in that case C goes to fixation. The plot is based on the survivorship values 
given in the text. 

Multiple Loci and Gene Interaction: Epistasis 

With multiple loci, as many types of gametes are possible as there are combi- 
nations of alleles. The simplest example is the two-locus, two-allele case, in 
which the possible gametes are AB, Ab, ab, and ab In the absence of recombi- 
nation (r = 0), each type of gamete can be regarded as an "allele" of one locus 
with four alleles. The principles of multiple-alleie selection then apply, and 
some of the "alleles" may be eliminated by selection The presence of recom- 
bination complicates matters because each gametic type is continually recre- 
ated by recombination even if it is disfavored by selection. The influence of 
recombination on the outcome of selection is determined by the recombina- 
tion fraction and by the degree of interaction between the loci. When selec- 
tion acts on the phenotype produced by the joint effects of multiple loci, there 
are two general situations: 

• Changes in allele frequency are driven primarily by the selection coeffi- 
cients and recombination plays a minor role. 

• Selection and recombination are about equally important in determining 
the outcome. 

The former is usually the case with weak epistasis and moderate or loose 
linkage; the latter is more prevalent with strong epistasis and tight linkage. 
The term epistasis is olten used in population genetics as a synonym for 
gene interaction; it applies to any situation in which the genetic effects of 
different loci that contribute to a phenotypic trait are not additive. In the two- 

Darwinian Selection 253 




% AA 

§ An 


c aa 



w m 

Genotype at B locus 








ZV 21 

a< 22 = 1 






Note. The table assumes that the two types of double heterozygotes, AB/ab and Ab/nB, have 
the same fitness, uf 32 

locus, two-allele example, the fitnesses (survivorships) of the genotypes can 
be written as shown in Table 6.3, where it is assumed that the two types of 
double heterozygote (AB/ab and Ab/aB) have the same fitness; for conve- 
nience, this value is often set at io 22 = 1 ■ For each single-locus genotype, the 
average survivorship is equal to the weighted average across each genotype 
at the other locus. In Table 6.3, these averages are denoted w AA , w Aa , and so 
on. Additivity across loci means that w u = zv AA + w m , w, 2 = w AA + w Bh , and so 
forth for all genotypes, including w lz = w M + w m = 1. If additivity does not 
apply across all nine genotypes, then epistasis is said to be present. A discus- 
sion of epistasis from a statistical point of view appears in Chapter 9 

When there is strong epistasis and tight linkage, complications abound 
With two loci and two alleles at each, there are as many as 15 equilibria Most 
of them are unstable, but examples are known in which four interior equilib- 
ria are simultaneously stable. Figure 6.12 is one example that shows the aver- 
age fitness in the population w as a function of the allele frequencies of A and 
B. At any point in time, the gametic frequencies in the population are deter- 
mined not only by the allele frequencies of A and B but also by the linkage 
disequilibrium parameter, which was denoted by the symbol D in the section 
on linkage disequilibrium in Chapter 3. In this example, all 15 equilibria are 
realized. There are four comer equilibria in which one gametic type is fixed, 
namely, AB„ Ab, aB, or ab; there are also four edge equilibria in which one 
allele of either locus is fixed, namely, A, a, B, or b. With the survivorships as in 
Figure 6.12, all of the corner and edge equilibria are unstable. There are also 
three unstable interior equilibria, each of which has p A = p„= i/ 2 and so is 
located at the position of the open circle on the saddle in Figure 6.12, these 
equilibria have the same allele frequencies but differ in the degree of linkage 
disequilibrium. The positions of the stable equilibria are indicated by the 
solid circles, each of which represents two equilibrium points with the same 

254 Chapter 6 

Two stable equilibria with 
D = -0 074412 
D = 40.074412 

85 - 

. , 0fi7 

' ' 33 

Frequency of A .ill He ° 

Figure 6.12 An example of two-locus, two-aUclc survivorship selection in which 
there arc four stable interior equilibria, the positions of which are indicated by the 

dots near (two of the corners Each dot represents two stable points differing in the 
sign of the linkage disequilibrium This example also includes three unstable inte- 
rior equilibria (represented by the open circled point in the center), four unstable 
edge equilibria, and four unstable coiner equilibria. The survivorship parameters, 

in the notation of Table 6 3, are w u = re n = «'n = ""n = 0.9, n/ 2i = ii> a = 0.8, and w a = 
uh.2 = 0-6; the recombination fraction is r = 0.09. (Example from Hastings 1985 ) 

allele frequencies but differing in their value of D. In this case, the equilibria 
are symmetrical 

Figure h 12 is a plot of Tv to emphasize that the average litness in the pop- 
ulation is not necessarily a maximum at equilibrium. In this example, none of 
the fou i stable equilibria is a point of maximum average fitness. The maxi- 
mum fitness is found at either of the four corners, and these equilibria are 
unstable. Furthermore, in the vicinity of each stable equilibrium, as the pop- 
ulation moves toward the equilibrium Irom certain directions, the average fit- 
ness must decrease as the equilibrium is approached. Hence, not only is 
average fitness not necessarily a maximum at equilibrium, natural selection 
can cause a decrease in average fitness. 

Darwinian Selection 255 

In models in which fitness depends on multiple interacting loci, do we 
really have to give up the attractive generalization from one-locus theory that 
selection acts in such a way as to increase average fitness? Not altogether. 
Although even the two-locus, two-allele model of survivorship selection is 
beyond present techniques of mathematical analysis, an important general- 
ization has come from approximate solutions as well as from computer sim- 
ulations (Ewens 1979): if epistasis is not too strong, and linkage is not too 
tight, then the average fitness in the population usually increases. 

This statement is multiply qualified ("not too strong . . . not too tight . . . 
usually increases") because exceptions can rather easily be constructed. 
However, the generalization is observed in most generations in most numer- 
ical examples when the survivorships are chosen at random (Karltn and 
Carmelli 1975). To the extent that it is true, the generalization supports the 
powerful metaphor that natural selection tends to increase average fitness. If 
one can imagine a complex surface of hills and valleys corresponding to 
regions of high and low average fitness, then one can speak metaphorically of 
a population as a sort of "hill climber" moving across this surface and scaling 
a fitness "peak." This picturesque analogy is a central concept in Wright's 
shifting balance theory of evolution, which is discussed in the last section of 
this chapter However, there are enough exceptions to the hill-climbing gen- 
eralization that maximization of average fitness cannot be used as a guide to 
predicting the outcome of any particular set of fitness values Each model 
must be considered in detail on its own. 

Sexual Selection 

It seems that, wherever you look in nature, animals have physical adornments 
or behavioral displays to help them in obtaining mates. In some cases, there 
is direct competition between animals, usually males, as exemplified by the 
contests of antler bashing in moose or head butting in bighorn sheep In other 
cases, there is indirect competition, as seen in the behavioral displays of male 
peacocks in full plumage strutting their stuff. These are dangerous activities. 
A bighorn sheep can get his skull fractured or fall off a cliff. The male peacock 
is conspicuous, burdened, and preoccupied — vulnerable to any predator. 

Darwin {1871) was the first to draw attention to competition lor mates as a 
source of selection not necessarily related to adaptation of the organism to its 
environment. This type of selection he called sexual selection In the case of 
direct competition for mates, it is easy to understand that a successful male 
leaves more progeny than an unsuccessful male, and so alleles promoting the 
physical adornments, strength, and aggressiveness needed for successful 
competition for mates are perpetuated even though they may occasionally be 
detrimental. The example of indirect competition is considerably more subtle 
because the male is merely advertising The female does the choosing One 
theory for the evolution of male sexual displays is that, in the early stages of 
their evolution, the displays take advantage of a female preference. The origin 

256 Chapter 6 

of the initial preference is unclear Darwin suggested that female choosiness 
and offspring number are both associated with superior nutrition, hence 
choosy females may, at the beginning, have had more offspring Whatever the 
cause, given an initial choosiness among females, males with more effective 
displays are chosen preferentially as mates, and their offspring receive alleles 
that create both the displays in the males and the preferences in the females. If 
these traits are genetically correlated— as, for example, through common hor- 
monal or neurological pathways or through linkage disequilibrium— then 
selection becomes a self-accelerating process promoting increasingly elaborate 
displays and increasingly greater choosiness. According to Fisher (1930): 

The two characteristics affected by such a process, namely plumage develop- 
ment in the male, and sexual preference for such developments in the female, 
must thus advance together, and so long as the process is unchecked by severe 
counterselection, will advance with ever-increasing speed. In the total absence 
of such checks, it is easy to see that the speed of development will be propor- 
tional to the development already attained. There is thus, in any situation in 
which sexual selection is capable of conferring a great reproductive advan- 
tage, the potentiality of a runaway process which will, however small the 
beginnings from which it arose, must, unless checked, produce great effects, 
and in the later stages with great rapidity. 

The ever-accelerating process is called runaway sexual selection, and the 
conditions under which it takes place have been studied theoretically (Lande 
and Arnold 1985; Kirkpatrick and Barton 1995; Iwasa and Pomiankowski 1995). 


One alternative type of selection, called kin selection, makes use of an 
extended concept of "fitness." In kin selection, a positive selection for cer- 
tain alleles takes place indirectly through enhanced reproduction of the 
genetic relatives of carriers of the alleles rather than directly through an 
increased fitness of the carriers themselves Kin selection has been postulat- 
ed in attempts to account for the evolution of altruism. A behavior is regard- 
ed as altruism if it increases the fitness of other organisms at the expense of 
one's own fitness. Altruistic behavior is exhibited most dramatically by 
social insects such as termites, ants, and bees, in which certain worker castes 
exert their labors for the care, protection, and reproduction of the queen and 
her offspring but do not reproduce themselves. Other, less dramatic exam- 
ples of altruistic behavior include phenomena such as the care of offspring 
by their parents. 

A central consideration in kin selection is that relatives have genes in com- 
mon. Therefore, a gene that causes altruistic behavior can increase in fre- 
quency if the increase in the recipient's fitness as a result of altruism is 
sufficiently large to offset the decrease in the altruist's own fitness. The essen- 
tials of the situation can be made clear by considering the case of identical 

Darwinian Selection 257 

twins. Because identical twins are genetically identical, the reproduction of 
one's twin is genetically equivalent to reproduction by oneself. Thus, it 
makes no difference if an altruistic organism decreases its own fitness for the 
sake of an equal increase in fitness of an identical twin; from an evolutionary 
point of view, it is an even trade because ihe combined number of offspring 
from both twins remains unchanged. By the same token, if an altruistic act 
decreases the fitness of an organism by an amount less than the increase 
gained by an identical twin, then the altruism results in a net increase in the 
combined number of offspring. One would, therefore, expect altruism 
between identical twins to be favored by natural selection as long as the risk 
to the altruist is no greater than the benefit to the recipient. 

These considerations of identical twins can be extended to other degrees 
of relationship as well, but the risk to the altruist must be correspondingly 
smaller than the benefit to the recipient because other types of relatives share 
fewer genes than identical twins. The break-even points for altruism toward 
various degrees of relationship have been trenchantly summarized by J. B. S. 
Haldane, who is said to have quipped that he would lay down his life for two 
brothers, four nephews, or eight cousins. In any case, fitness considerations 
that take into account not only an organism's own fitness but also the fitness 
of relatives (other than direct descendants) constitute what is called the 
inclusive fitness of the organism. 

To be concrete, suppose that altruism results in a decrease in fitness c of 
the altruist that is offset by an increase in fitness b in the recipient The gene 
for altruism increases in frequency if the ratio of cost to benefit is great 
enough, relative to the genetic relationship between the altruist and the recip- 
ient; that is, the gene for altruism increases in frequency if 



as shown first by Hamilton (1964) and discussed in detail by Cavalli-Sforza 
and Feldman (1978) and Uyenoyama and Feldman (1980). In this context, r is 
a measure of genetic relationship between the altruist X and the recipient of 
the altruism Y, defined as 

r _ 2fxv 


where F x is the inbreeding coefficient of the altruist X, and F XY is the inbreed- 
ing coefficient of a hypothetical offspring of X and Y. As illustrated in Figure 
6.13, r equals the probability that two gametes from X and Y contain alleles 
that are identical by descent, F XY , relative to the probability that two gametes 
from X contain alleles that are identical by descent, (1 + F x )/2. The cost- 
benefit tradeoff in Equation 6.24 is generally valid for weak selection when 
Fx = and valid for additive alleles even when F x 1 Q (Aoki 1981 ). 

258 Chapter 6 


* F x )/2 Fxy 

Figure 6.13 Definition of the genetic relationship between an altruist X and 
the recipient of the altruism Y. (A) Two alleles chosen at random from an organ- 
ism X are identical by descent with probability (1 + F x )/2 (see Figure 4.13). (B) 
Two alleles chosen at random, one from X and the other from Y, are identical by 
descent with probability F XY , which is the inbreeding coefficient of a hypotheti- 
cal offspring of X and Y The ratio of F xv to (1 + F x )/2 is the appropriate measure 
of genetic relationship in the consideration of kin selection. 

PROBLEM 6.1 1 For the illustrated pedigrees (A) and (B) of full sib- 
lings shown in the accompanying figure, calculate the break-even value 
of the benefit h to the recipient of altruism Y, relative to a cost value c = 1 
to the altruist X, in order to ensure an increase in frequency of an addi- 
tive gene for altruism. Why are the answers different in the two cases? 


ANSWER In case (A), a hypothetical offspring of X and Y has an 
inbreeding coefficient of F XY = {Vif + (V 2 ) 3 = V*, and F x = 0. Therefore, 
r = 2 x y 4 = % and the break-even value of c/b = Vi- Hence, for c = 1, 
the break-even value of b =* 2. (This calculation is the theoretical basis 
of Haldane's quip about laying down his life for two brothers.) In 

Darwinian Selection 259 

pedigree (B), F m = 4 x (V 2 ) 5 + 2 x (V 2 ) 3 = %, and F x = 2 x (Y 2 ) 3 = V 4 . 
Therefore, r * 2(3/ fi )/{l + »/ 4 ) = % For a cost of c = i, the break-even 
value of b equals %. The values differ in the two cases because of the 
differing inbreeding. In case (B), even though X is inbred, the break- 
evert value of b is smaller because of the closer genetic relationship 
between X and Y. 


Another alternative type of selection arises in the context of interdeme selec- 
tion, which takes place between semi-isolated subpopulations (denies) of the 
same species. If subpopulations composed of certain genotypes are more 
likely to become extinct and have their vacated habitats recclonized by 
migrants from other subpopulations composed of other genotypes, then the 
more successful subpopulations can, in some sense, be considered as having 
a greater "fitness" than the less successful ones Since this concept of popula- 
tion fitness is a characteristic of the entire population and not merely the 
average fitness of the genotypes within it (w), interdeme selection is outside 
the realm of most conventional models of selection. Interdeme selection is 
one type of group selection (Wilson 1983). 

Interdeme selection plays an essential role in the shifting balance theory 
of evolution of Wright (1977 and earlier). In the shifting balance theory, a 
large population that is subdivided into a set of small, semi-isolated sub- 
populations (demes) has the best chance for the subpopulations to explore 
the full range of the adaptive topography and to find the highest fitness peak 
on a convoluted adaptive surface. If the subpopulations are sufficiently 
small, and the migration rate between them is sufficiently small, then the 
subpopulations are susceptible to random genetic drift of allele frequencies, 
which allows them to explore their adaptive topography more or less inde- 
pendently. In any subpopulation, random genetic drift can result in a tem- 
porary reduction in fitness that would be prevented by selection in a larger 
population, and so a subpopulation can pass through a "valley" of reduced 
fitness and possibly end up "climbing" a peak of fitness higher than the 
original. Any lucky subpopulation that reaches a higher adaptive peak on 
the fitness surface increases in size and sends out more migrants to nearby 
subpopulations, and the favorable gene combinations are gradually spread 
throughout the entire set of subpopulations by means of interdeme 

The shifting balance process includes three distinct phases: 

1. An exploratory phase, in which random genetic drift plays an important 
role in allowing small subpopulations to explore their adaptive topography 

260 Chapter 6 

2 A phase of mass selection, in which favorable gene combinations created 
by chance in the random drift phase become rapidly incorporated into 
the genome of local subpopulations by the action ol natural selection. 

3. A phase of interdeme selection, in which the more successful demes 
increase in size and rate of migration; the excess migration shifts the 
allele frequencies of nearby subpopulations until they also come under 
the control of the higher fitness peak. The favorable genotypes thereby 
become spread throughout the entire population in an ever-widening 
distribution. Where the region of spread from two such centers overlaps, 
a new and still more favorable genotype may be formed and itself 
become a center for interdeme selection. In this manner, the whole of the 
adaptive topography can be explored, and there is a continual shifting of 
control from one adaptive peak to control by a superior one. 

The shifting balance theory has played an important role in evolutionary 
thinking, in part because of its use of mountain -climbing terms as tropes for 
stages in the evolutionary progress: "exploration" of the adaptive topogra- 
phy, chance "discovery" of a route to a higher adaptive peak, and ultimately 
the "conquest" of the highest adaptive peak by the whole species. However, 
as a comprehensive theory of evolution, many aspects of the theory remain 
untested. For the theory to work as envisaged, the interactions between alle- 
les must often result in complex adaptive topographies with many peaks and 
valleys. The population must be split up into smaller subpopulations, which 
must be small enough for random genetic drift to be important, but large 
enough for mass selection to fix favorable combinations of alleles. Although 
migration between demes is essential, neighboring demes must be sufficient- 
ly isolated for genetic differentiation to take place, but sufficiently connected 
for favorable gene combinations to spread. Because of uncertainly about the 
applicability of these assumptions, the shifting balance process remains a pic- 
turesque metaphor that is still largely untested. However, computer simula- 
tions have been carried out to investigate the range of magnitudes of the key 
parameters that are necessary for the shifting balance process to be effective; 
these parameters include the size of the subpopulations, the rate of migration 
and range of dispersal of the migrants, the degree of epistasis between genes, 
and the rate of recombination (Bergman et al. 1995). Some empirical studies 
have also explored the partitioning of genetic variance within and between 
groups for traits associated with fitness (Wade and Goodnight 1991). 

One important implication of interdeme selection is that alleles that are 
harmful in themselves may nevertheless be favored because they are benefi- 
cial to the group. This principle is illustrated in the model in Table 6.4, where 
the allele A' is harmful to organisms within demes but favorable to the deme 
as a whole. Equation 6.11 implies that, within the ith deme, Aq, = -cq,(l - q,) 
(assuming that w = 1). Averaging across all of the subpopulations, the change 
in allele frequency resulting Irom selection within subpopulations, Aq u „ 
equals -c<T(1 -rf)(l - F), where F is the fixation index F S7 discussed in Chap- 

Darwinian Selection 261 




Frequency in deme i 
Within-population fitness 
Between-population fitness 
of deme i 


1 -c 
1+20 -<:),;, 


1 -2c 

ter 4. At the same time, within-subpopulation selection takes place, inter- 
deme selection favors demes containing A', and the change in allele fre- 
quency resulting from between-subpopulation selection, Aq b , equals 
2(b - c)q (1 - q)F, as shown by Crow and Aoki (1982). Putting the within- 
subpopulation and between-subpopulation selection together, the total 
change in the frequency of A' is 

Aq = Aq w + Aq h = -cq( l-fl)(l-F) + 2{b - c)q{\ - q)F 6.26 

The terms on the right-hand side can be interpreted by considering the 
extremes of F = and F = 1. When F = 0, there is no population substructure, 
which means that all subpopulations have the same allele frequency q; in this 
case, the change in allele frequency is just -cq{\ - q). At the other extreme, 
when F = 1 each subpopulation is fixed for either A or A', and the proportion 
fixed for A' equals q. The between-subpopulation selection is therefore 
analogous to selection between alleles in a haploid organism in which the 
fitnesses of A and A' demes are in the ratio 1 : 2(b - c). In this case, therefore, 
the change in allele frequency is 2(b - c) q(\ - If) (from Equation 6.6, assum- 
ing that w = 1). 

Equation 6.26 implies that A<? > if 

b-c 1-F 
c > 2F 


This is the condition necessary for selection between demes to override 
selection within demes, and the formulation is quite general (Crow and Aoki 
1982). A biological interpretation of the inequality in Equation 6.27 can be 
inferred by comparison with the break-even point for kin selection given in 
Equations 6.24 and 6.25. Expressing 6.27 in terms of r = 2F/(1 + F), which 
means that F = r/(2 - r), yields c/b < r; this condition is identical to Equation 
6 24. In these models, the equivalence between kin selection and interdeme 
selection results from the shared remote ancestry of the members of each 
subpopulation caused by random genetic drift among the subpopulations. 
The members of each subpopulation are related by kinship, and so interdeme 
selection is the same phenomenon as kin selection; the break-even point is 
that at which the benefit b to one's kin through interdeme selection equals the 
cost to one's self c through direct selection against I he A' allele. 

262 Chapter 6 

If there are a large number of subpopulalions, each of size N, that 
exchange migrants in such a way that m is the proportion of genes in each 
deme that are exchanged each generation for genes chosen at random from 
the other demes, then the approximate value of F at equilibrium is given by 
Equation 5 17 as F = 1 /(] + 4Nm). Consequently, the right-hand side of Equa- 
tion 6 27 becomes INm. In other words, (1 - F)/2F equals the number of 
migrant diploid organisms per generation We therefore conclude from Equa- 
tion 6.27 that selection between demes overrides selection within demes only 
when the benefit to the group (b - c), relative to the cost to the individual 
organism (r), is greater than the average number of migrant organisms per 
generation This principle defines a rather stringent limit above which migra- 
tion among demes cancels any possible effects of interdeme selection. 


Natural selection can take place in many different ways. The simplest case is 
that in a haploid organism in which the relative fitnesses of the alternative 
genotypes are constant. Models of discrete generations and of continuous 
exponential growth are presented In the discrete model, the relative fitness- 
es are called Darwinian fitnesses; in the continuous model, they are called 
Malthusian fitnesses. The relationship is that In w - m, where w and m are the 
Darwinian and Malthusian fitnesses, respectively. 

In a diploid organism, continuous population growth is difficult to model 
when the genotypes differ in their rates of reproduction. The "standard" 
diploid model is that of discrete generations in which the genotypes may dif- 
fer in the probability of survival from fertilization to adulthood (survivorship 
or viability selection) but are equal in fertility. In such a model with two alle- 
les, A and a, and constant fitnesses of the diploid genotypes, four outcomes of 
selection are possible: A becomes fixed; a becomes fixed; there is a globally 
stable equilibrium; or there is an unstable equilibrium. Fixation of A or a 
results from directional selection in which either AA or an is favored and the 
fitness of the heterozygous genotype is intermediate between the homozy- 
gous genotypes (or possibly equal to one of them). The stable equilibrium 
results from heterozygote superiority (overdominance), in which the fitness 
of the heterozygous genotype exceeds that of both homozygous genotypes. 
At the stable equilibrium of allele frequency, the average fitness in the popu- 
lation id is maximized. An unstable equilibrium arises when the fitness of the 
heterozygous genotype is smaller than that of both homozygous genotypes. 
The outcome of selection then depends on the initial conditions; fixation of 
either ' < t n takes place according to whether the initial frequency of A is 
greatei than or less than the unstable equilibrium frequency. 

Mutation -selection balance refers to the maintenance ot a harmful allele in 
a population at a low equilibrium frequency because, in every generation, the 

Darwinian Selection 263 

elimination of preexisting harmful alleles by selection is offset by the introduc- 
tion of new harmful alleles by mutation. For a completely recessive allele in 
which the relative fitness of the homozygous recessive genotype is 
1 - s, the equilibrium frequency of the harmful allele is given by q = VuTT, 
where u is the rate of mutation per generation of the wildtype allele to the 
harmful allele. For a partially dominant allele, the relative fitnesses of the het- 
erozygous and homozygous genotypes carrying the harmful allele are 1 - lis 
and I - s, where h is the degree of dominance. In this case, the equilibrium 
allele frequency is given approximately by q = yt/hs An important implica- 
tion of these formulas is that a small degree of dominance of a "recessive" 
allele has a disproportionate effect in decreasing the allele frequency at equi- 
librium. Another important implication is the Haidane-Muller principle, 
which states that, at mutation-selection balance, the total genetic load (mea- 
sured as the product of the genotype frequency times the decrease in fitness 
of the genotype) is independent of the fitnesses and depends only on the rate 
of recurrent mutation 

In nature, selection must often be expected to have a more complex mech- 
anism than that of differential survivorship envisaged in the standard model. 
Among the more complex types of selection are frequency-dependent selec- 
tion, density-dependent selection, fecundity selection, selection in age-struc- 
tured populations, selection when there are heterogeneous environments, 
diversifying selection favoring rare alleles or genotypes, differential selection 
in the sexes, selection for X-Iinked genes, gametic selection, meiotic drive 
(non-Mendelian segregation), multiple-alleles selection, multiple-loci selec- 
tion, and sexual selection. Multiple loci are a particularly important source of 
complexity even when the fitness differences are entirely due to survivor- 
ship. In particular, with strong epistasis and tight linkage, there may be mul- 
tiple stable interior equilibria and the equilibria may not coincide with points 
of maximum average fitness. With weak epistasis and loose linkage, howev- 
er, the average fitness in the population usually does tend to increase 

Extended concepts of fitness can include the effects of selection acting on 
groups of relatives or on subpopulalions. Kin selection invokes the concept of 
inclusive fitness, which embraces not only an organism's own fitness but also 
the fitness of its relatives (exclusive of direct descendants) Kin selection has 
been invoked to explain the evolution of many behavioral traits that appear 
to be detrimental to the individual organism but beneficial to its relatives. 
The most dramatic examples are found in social insects, in which certain 
organisms are reproductively sterile and devote their lives to the care and 
feeding of the queen and the protection of the colony Generally speaking, 
alleles for altruistic behavior can increase in frequency if the loss in fitness of 
the altruist is offset by the increase in inclusive fitness to the beneficiaries of 
the altruism. More precisely, for additive alleles, the condition for increase in 
frequency of an allele predisposing to altruism is c/b < r, where r and b are 

264 Chapter 6 

the fiiness cost to the altruist X and benefit to the relative Y, respectively, and 
r = 2F xv /(l+F x ). 

Jnterdeme selection plays an important role in the shifting balance theo- 
ry of evolution. According to this theory, adaptive topographies are highly 
complex surfaces with many peaks and valleys. In small, partially isolated 
subpopulations, random genetic drift promotes the random exploration of 
the topography. When, by chance, a subpopulation comes under the control 
of a higher fitness peak, mass selection takes precedence and rapidly multi- 
plies the favored gene combinations Excess migration from the successful 
subpopulation shifts the allele frequencies in surrounding subpopulations 
and, through repetition of the selection process, the favored gene combina- 
tions progressively spread m waves throughout the entire population. Influ- 
ential as metaphor, the shifting balance theory has not yet been adequately 
evaluated as an accurate description of the principal mechanism of evolu- 
tionary change. 


1 . Suppose that in the ?th generation of a haploid population the fitnesses of 
A and a are 1 : s ( . Show that p„/q„ = (p /qo)(s&iS2 • ■ Vi)- U this is written 
as p n /q„ = (po/<Jo)s", tnen how can s be interpreted? 

2. If the fitnesses of AA, Aa, aa are 1 .0, 0.9, 0.6, and p = 0.7, calculate p h p 2 , 
and p vV the allele frequencies after 1, 2, and 3 generations of selection. 

3. Calculate the equilibrium allele frequency with overdominance when the 
fitnesses of AA, Aa, and aa are, respectively: 

a. 300,1,0.700. 

b. 0.930, 1, 0.970. 
c 993,1,0 997. 

4. Calculate w for w„ = 0.9, iv n = 1, Wu = 0-6, and y = 8, assuming random 
mating. Does any other p give a larger ft>? Why or why not 7 

5. If a rare allele that is lethal when homozygous decreases in frequency by 
1% each generation (i.e., q' = 0.99q), then what is the selection coefficient 
against heterozygotes? (Hint: Assume that qlt is small compared to 1.) 

6. If selection is not too intense, an additive gene giving fitnesses 1 + s, 
1 + s/2, and 1 in AA, Aa, and aa will increase in frequency approximately 
according to \n(p,/q,) = lnfo,/^) + (s/2)'. Calculate the approximate num- 
ber of generations required to evolve significant insecticide resistance in 
an insect population when s = V 2 and p n = 10 5 . Significant resistance in 
the population may be taken as p, = 10"' Show that, when p,/p « 1, 
i = (2/s)\n( Pl /p ). 

7. Show that a random mating diploid population with fitnesses 1 , 1 - s, and 
(] _ s f f or AA, Aa, and aa gives the same change in the allele frequency p 
of A as a haploid population with fitnesses 1 and 1 - s of A and a. 

Darwinian Selection 265 

8. If selection is not too strong, the time required for the allele frequency of 
a favored dominant allele to change from p to p, is given by 

HPi/q,) + (1 /q t ) = lHp„/q l} ) + (1 /(jo)] + si 

Use this equation to derive the analogous equation for a favored 

9. The following equation has equilibria at p = 0, >/ 2 , and 1. Classify the 
equilibria as to stability. If there is a stable equilibrium, is it locally or 
globally stable? 

Ap = p(V 1 -p)(l-p) 

10. Show that the allele frequency of a recessive lethal in generation n is 
given by q„ = q /(l + nq ). {Hint: It is easiest to derive an expression first 
for l/q„.) How many generations are required to reduce the allele fre- 
quency by half? 

11. The mutation rate to a dominant gene for neurofibromatosis is approxi- 
mately 9 x 10~ 5 and the reproductive fitness of affected individuals is esti- 
mated as V 2 . What is the expected equilibrium frequency of affected 
individuals at birth? 

12. What is the equilibrium frequency of a recessive gene arising with a 
mutation rate of 4 x 10~ 6 and a reproductive fitness in homozygotes of 
0.8? What would it be if the gene were partially dominant with h = 0.05? 

13. What is the equilibrium frequency of a recessive gene arising with a 
mutation rate of 10" 6 with a fitness of 0.4 in homozygotes? How much 
would this be reduced if the homozygotes did not reproduce at all? 

14. For a rare allele maintained at an equilibrium frequency of q = u//i, 
where h is the selection coefficient against heterozygotes, show that the 
proportion of heterozygous zygotes resulting from new mutations is 
approximately equal to h. 

15. A polymorphism is said to be protected if all of the fixation states are 
unstable equilibria. Suppose the viabilities of males and females are as 






Pi 2 


What is the smallest value of v n that ensures a protected polymor- 
phism? {Hint: Some algebra shows that a condition for polymorphism is 
u> ]2 /w n + V\i/v X \ > 2 and w ]2 /w 2 2 + v n /v n > 2.) 
16. If allele a is a recessive lethal in zygotes and the relative fitness of A: a 
gametes is 1 - s : 1, then what is the equilibrium allele frequency of a? 

266 Chapter 6 

{Hint: The recursion simplifies greatly for the ease of a recessive lethal, 
and equilibrium is given by p = v { w n j[{v\+v 2 )w\2-V\W\x\) 

17 In a Drosoptiila population cage containing a meiotically driven chromo- 
some known as segregation distorter, the equilibrium frequency of the 
driven chromosome was approximately 0. 125 and the segregation ratio in 
heterozygotes was about k = 0.75 (Hiraizumi, Sandler, and Crow 1960). 
The meiotic drive chromosome is homozygous lethal in both sexes. 
The equilibrium between viability and meiotic drive in this case is p = 
2(Jt - l)wi 2 /(l - 2 ™n)- Use this equation to estimate the approximate value 
of W\ Z consistent with these data. 

18 In a multiple allele system in which each heterozygote is superior to the 
homozygotes for the alleles it contains, why are all alleles not maintained 
bv selection? 

19 The viabilities of genotypes A' A', A' A, and AA are 0.5, 1, and 7, respec- 
tively. If the initial frequency of allele A' is .05, what will the frequency be 
when the population comes to equilibrium? If a mutation occurs, intro- 
ducing a novel allele A", such that the fitnesses of A" A", A"A', and A"A 
are all 0.8, determine whether this allele will increase in frequency. 

20. Suppose alleles A u A 2 , A„ and A, are additive in their effects, and the 
homozygote fitnesses are /Mi : 0.8, A 2 A 2 : 0.6, A 3 A 3 : 4, and A^ : 2. 
What are the heterozygote fitnesses 7 If all alleles are equally frequent, 
what is the mean fitness for this locus? 


Random Genetic Drift 

Random Genetic Drift Binomial Sampling 


Wright-Fisher Model 

or each c.FNERATiON there is an element of chance in the drawing 
of gametes that will unite to form the next generation. Chance 
alone can result in changes in allele frequency, and because the 
allele frequencies do not change in any predetermined way by this sampling 
process, the process is known as random genetic drift, hi Chapter 5 we 
looked at some of the basic principles of how random genetic drift affects lev- 
els of variation in populations, but the subtlety and importance of drift are 
such that we will now devote this chapter to the subject. 


Consider a large population at Hardy-Weinberg equilibrium with alleles A 
and a at equal frequencies p = q = 'A. In (his population, the genotype fre- 
quencies are */ 4 AA, V 2 Aa and V 4 oa. Suppose four individuals are drawn ai 
random from this population to start a colony. It is possible, by chance alone, 
that the sample will consist of 4 AA individuals. (This chance is 
O/4) 4 = 1 /256.) Similarly, it is possible that all four will be na Any other possi- 
ble sample could have been drawn, and it is not difficult to work out the 
probability for each type of sample. If the colony remains at just four indi- 
viduals, this same kind of random sampling occurs each generation At each 
generation, there is an opportunity for a large change in gene frequency 
caused purely by this process of sampling. One consequence of drift soon 
becomes clear — eventually the population will have either all A alleles or all 


268 Chapter 7 

Figure 7.1 The gene frequencies and sampling that occur in the Wright- 
Fisher model Initially there are N diploid adults with a gene whose frequency 
is p„. The adults make an infinite number of gametes having the same allele fre- 
quency From this pool, IN gametes are drawn at random to constitute the N 
diploid individuals for the next generation. 

« alleles. Once the population reaches such a "fixation" state, it is stuck. Only 
new mutations or migrants into the population can reintroduce variation. 

In the example above we sampled four diploid individuals each genera- 
tion For our purposes, this is equivalent to drawing eight gametes at random 
from a pool of gametes. For example, if eight gametes are drawn from a pop- 
ulation with p = V 2 , there are nine possible outcomes, having 0, 1, 2, 3, . . . , 8 
copies of the A allele and the remaining copies being the a allele. The proba- 
bility of each of the nine possibilities is given by the binomial distribution, 
first introduced in Chapter 1. For the case of fixation, we need to find the 
probability of drawing eight copies of the A allele. Each draw is considered 
independent of the other draws, and each has a chance of V 2 of yielding an A. 
This means that the probability of drawing eight consecutive A alleles is 
(V 2 ) H = 1 /256. It is no coincidence thai this is the same as the probability of 
drawing four AA genotypes as described above. 

In sampling gametes from a finite population, the sampling process is 
depicted in Figure 7.1. In each generation there are N diploid individuals in 
the population. Regardless of the way fertilization occurs, we can imagine 
the sampling process to be one of sampling with replacement, such that the 
diploid individuals contribute to an essentially infinite gamete pool whose 
allele frequency is the same as the allele frequency in the adults. From this 
infinite gamete pool, 2N gametes are drawn and unite at random to form the 
nexf generation. Under this kind of sampling process, the distribution of fre- 
quencies of gametes is expected to be binomial. 

PROBLEM 7. 1 Suppose there are a thousand round pea seeds and a 
thousand wrinkled pea seeds in a soup pot. Enumerate at! possible 
samples of four seeds drawn from the pot, and calculate the probabil- 
ity of each. 

Random Genetic Drift 269 

ANSWER The chance of drawing a round seed is V 2 (as is the chance 
of drawing a wrinkled seed). The chance of drawing four round seeds 
is roughly (V 2 ) 4 = y w , since the fraction of round seeds remains fairly 
close to V 2 even after a few are drawn. The chance of drawing all four 
wrinkled seeds is also V 16 . There are four ways to get three round and 
one wrinkled seed: RRRW, RRWR, RWRR, and VVRRR, and each of 
these has chance V 16 . Similarly, there are four ways to get three wrin- 
kled and one round seed: WWWR, WWRW, WRWW, and RWWW, 
and again, each of these four possibilities had probability V u , Finally, 
we could draw two round and two wrinkled seeds, and such a sample 
could be drawn iri any of Six possible orders: RRWW, RWRW, RWWR 
WRRW, WRWR, and WWRR. Each order has chance V 16 , so the total 
chance of getting two round and two wrinkled seeds is 6 / ]6 = %. We 
have exhaustively enumerated all 16 possible samples of four seeds, 
and since each of the 16 possibilities has chance VW the sum of the 
probabilities of the events is 1. This is a check that we have considered 
all possibilities. Note that the binomial distribution (Equation 7.1) 
makes these calculations much easier. 

To take a specific example, a population of nine diploid organisms arises 
from a sample of just 18 gametes, but the gametes can be thought of as being 
sampled from an essentially infinite pool of gametes. Because small samples 
are frequently not representative, an allele frequency in the sample may dif- 
fer from that in the entire pool of gametes. In fact, if the number of gametes 
in a sample is represented as 2W (in this example, 2/V = 18), the probability 
that the sample contains exactly / alleles of type ,4 is the binomial probability 

Pr(0 = | 2 *V<7 2N -' 



where means (2N)l/i!(2N - /)'; p and q are, respectively, the allele fre- 

quencies of A and a in the entire pool of gametes (p + q = 1); and / takes on 
any integer value between and 2N. The new allele frequency in the popula- 
tion (call it p'} is therefore i/2N because, by definition, the allele frequency of 
A equals the number of A alleles (in this case /) divided by the total (in this 
case 2N). In the next generation, the sampling process occurs anew, and the 
new probability of a prescribed number of A alleles occurring in the 2/V 

270 Chapter 7 

gametes is given by the binomial probability above, with p now replaced by 
;/ and q by 1 - p' Thus, the allele frequency may change at random from gen- 
eration to generation Computer-generated examples based on random num- 
bers are shown in Figure 7.2. Each line in Figure 7.2A gives the number oi A 
alleles in 20 successive generations of random genetic drift in a population of 
size N = 9 (so 2N = 18). As you can see, individual populations behave very 
erratically. In seven populations, the A allele became fixed (that is, p = 1); in 
five populations, A became lost (that is, p - 0). The other eight populations 
remained unfixed {A was neither fixed nor lost), but the final allele frequen- 
cy among the unfixed populations was as likely to be one value as another. 
Figure 7.2B shows the same kind of simulation, except now 2N = 100. With a 
larger population size, the rate at which populations go to fixation is evi- 
dently slower. The principal conclusion from Figure 7.2 is that allele frequen- 
cies behave so erratically in any one population that prediction is virtually 

Although changes in allele frequency due to random genetic drift in any 
individual population may defy prediction, the average behavior of allele fre- 
quencies in a large number of populations can be predicted. Consider a large 
number of populations all starting at the same time with the same allele fre- 
quency and same population size N. Each of these populations is assumed 
to undergo drift independently of the other populations. Except for their 
finite size, the subpopulations are assumed to satisfy all the assumptions of 
the Hardy-Weinberg model, with the additional stipulations that (1) the 
number of males and females is equal, and (2) each individual has an equal 
chance of contributing successful gametes to the next generation. The key 
point illustrated in Figure 7.3 is that we can describe how these populations 
change in allele frequency by considering time slices through the graph, and 
tallying a histogram of the counts of populations having each specified allele 
frequency. Initially, the populations will all be close to the starting allele fre- 
quency. As time passes, the populations "drift" apart, and eventually they are 
spread over all possible allele frequencies. Finally, as we will see, each popu- 
lation must go to fixation for one allele or the other. 

The trick in understanding drift is to learn how to deduce the distribu- 
tions of allele frequencies plotted in Figure 7.3. We just described what would 
happen after one generation — the set of populations would have a range of 
allele frequencies as described by the binomial distribution. The binomial 
distribution gives us the probability that a population has allele frequency p' 
after one generation of drift. If we consider 1000 populations all starting at p, 
the binomial distribution gives us the fraction of those populations with 
allele frequency p'. What about the following generation? For each popula- 
tion, one can imagine the whole sampling process as starting over again. The 
population does not remember where it was the previous generation, and so 
the binomial sampling occurs again. But this time, the allele frequency isp', 

Random Genetic Drift 271 

8 12 


(B) 1 r 

2W= 100 

« 12 


Kefctfc drift ^7'" Simulations of the Wrigh t -Fish,r model of random 
genetic drift. Each line represents a population of size (A) 2W , 18 or (B) 2N - 

op ncement as described in the text. An allele frequency of n = S ,n A implies 

SlW if ! m P ,u » H> copies of each allele Not, fhnt the l^er pV- 
« eof I Z on m Sm " osdIlat, « n! ' ° f "We frequency, .md a sloJe/ 

272 Chapter 7 


10 1 

Allele frequency, x 

Figure 7.3 The model of random genetic drift can be seen by imagining a 
large collection of populations undergoing the process of repeated sampling. As 
the top pari of the figure indicates, the populations' allele frequencies change 
erratically, and tend to drift apart. At time intervals, a snapshot of the popula- 
tions would produce distributions of allele frequencies whose variance increases 
over time. 

and this value must be the frequency used in Equation 7.1 . Each of the 1000 
populations may have a different p' after one generation of drift, so to get the 
second generation, we need to calculate the binomial distribution 1000 times 
and add the values up. Fortunately, R.A. Fisher and Sewall Wright figured 
out an easier way to do this, which is described in the next section. 

An experiment designed along the same lines as the one given in Figure 
7.3 is shown in Figure 7.4. In this study, the history of 19 generations of ran- 
dom genetic drift in 107 subpopulations of Drosophila melanogaster was fol- 
lowed. Each population was initiated with 16 bzo^/bw {bw = brown eyes) 
heterozygotes and maintained at a constant size of 16 individuals by ran- 
domly choosing eight males and eight females to produce the next genera- 
tion Each histogram in Figure 7.4 gives the number of populations 
containing 0, 1, 2, . . . , 32 bw 75 alleles. The pattern of change in allele frequen- 
cy in Figure 7.4 may at first appear to be complicated, but in reality a simple 
thing is happening. The initially humped distribution of allele frequency 
gradually becomes flat as populations fixed for bur 75 or bw begin to pile up at 
the boundaries. The piling up occurs because, once an allele has been fixed or 

Random Genetic Drift 273 

Figure 7.4 Random genetic drift in 107 actual populations of Drosophila 
melanogaster. Each of the initial 107 populations consisted of 16 baPfko het- 
erozygotes (N = 16; bw = brown eyes). From among (he progeny in each genera- 
tion, eight males and eight females were chosen at random to be the parents of 
the next generation. The horizontal axis of each curve gives the number of bw 75 
alleles in the population, and the vertical axis gives the cor res ponding number 
of populations. (Data from Buri 1956.) 

lost, it remains fixed or lost since mutation is negligible over such a small 
number of generations in small populations After 19 generations, most of the 
populations are fixed for one allele or the other, and among the unfixed pop- 
ulations, the distribution of allele frequencies is essentially flat. 

PftOBLtM 7,2 Cbfifttftf a Self-pollinating plant population consist- 
ing of a single heterozygous {Aa) individual on a small barren island. 
Suppose the plant reproduces and dies, so that the generations are 
discrete, and the population can only consist of a single plant. What is 

274 Chapter 7 

the probability that the population is homozygous at this genetic 
locus by the second generation? 

ANSWER The chance that the first generation offspring is AA is V 4 
and the chance that it is aa is also V 4 , so the chance of fixation in one 
generation is V 2 . If the first generation offspring is Aa, then the proba- 
bility of fixation in the second generation {given that the population is 
not fixed in the first generation) is again % The probability of not fix- 
ing in generation 1 and then fixing in generation 2 is V 2 x % = V* Add 
to this the chance of fixing in one generation and we get 3 / 4 as the 
probability of fixation by two generations. Note that the probability of 
not going to fixation each generation is V2, and so the chance of not 
fixing for two generations is Vi x V2 = '/*■ 

Consider an infinitely long bowling alley with minor imperfections that 
displace the ball one way and the other. The gutters represent the fixation 
states of p = and p = 1. Once the ball goes in the gutter, it cannot get out 
again. The imperfections keep the ball from rolling in a straight line, and 
eventually it rolls into the gutter. In this analogy, the size of the population 
corresponds to the width of the bowling alley; a larger population implies a 
wider alley. The imperfections still deflect the ball but, in proportion to the 
width of the alley, the ball's zigs and zags are of a smaller magnitude. Conse- 
quently, the ball remains out of the gutter for a longer time, analogous to the 
longer time to fixation for a larger population. But just as certainly, the ball 
will eventually land in the gutter. 


Fisher (1930) and Wright (1931) both considered the consequences of the sort 
of binomial sampling that occurs in small populations when the sampling 
occurs repeatedly over many generations. This model, known as the Wrighl- 
Fisher model, derives the distribution of allele frequencies among popula- 
tions undergoing random genetic drift. Although neither Fisher nor Wright 
formulated the problem in terms of matrices, as used here, this approach 
makes the problem much simpler and gives the same results. If a population 
has 2N genes, and there are two alleles (A and a) that may be segregating, 
then the state of the population can be described by the number of A alleles 
in the population. The possible states are (hen 0, 1, 2, 2JV. The states and 

Random Genetic Drift 275 

2N are special in that these are fixations, and once the population gets into 
these states, it cannot leave. The states and 2/V are called absorbing states 
From any other allele frequency, it is possible for the population to drift to a 
different allele frequency. However, to use an example from Figure 7 4, if 
2N = 32, then the chance of drifting in one generation from 30 copies of gene 
A to 29 copies of gene A is greater than the chance of drifting to two copies. 
The probability of the population drifting from the state having 1 copies to ) 
copies of allele A is known as the transition probability The transition prob- 
ability for the Wright-Fisher model is obtained directly from the binomial 
distribution. If a population has ("copies of allele A, then the allele frequency 
is p = i/ZN, and the frequency of allele aisq=\- //2/V. The probability of 
going from / copies of A to / copies of A in one generation is: 

T„ = 

>a 2N -i 



The transition probabilities can be put in a square matrix T, with elements 
T, f giving the transition probability from state / to state/ for i r j = 0,1,2, , 2N 
The matrix T contains everything that is needed to predict the expected distri - 
bution of populations like those in Figure 7.4 over a series of generations This 
type of model, expressed in terms of discrete states with fixed probabilities of 
going from one state to another, is known as a Markov chain, and it has some 
very elegant mathematical properties. Iterations of (he Wright-Fisher model 
give the expected outcome of a pure drift process (Figure 7.5). We will use the 
Wright-Fisher model only to show one aspect about fixation probabilities. 

PROBLEM 7.3 Consider a population of four diploid individuals. 
Calculate the probability that a population with four copies of allele A 
{allele frequency p = V 2 ) drifts in one generation to having three 
copies. What is the probability that the population has four copies of 
-A? Five copies? Now consider a population of the same size, but ini- 
tially with two copies of A What is its probability of drifting to one, 
two, or three copies? 

ANSWER Applying Equation 7.2, we get T 4 , = [8!/(5!3!)](V 2 ) s = 7/32 
= 0.219. Ty = [8!/(4!4!)J {%f = 70/256 = 0.273"! T 4r , = T w = 0.219. (Note 
that the binomial distribution is symmetric when p = % so there is 
equal probability for samples that are symmetrically divergent from 

276 Chapter 7 

p - V 2 .) In the case when the initial frequency is 2 / 8 , 
T n = [8!/l!7!)J (V 4 )(3/ 4 ) 7 = 0.267, T u = [8!/(2!6!)l (>/ 4 )W = 0. 
r w = [8!/(3!5!)](y 4 ) 3 (%) 5 = 0.208. 

we get 
311, and 

The above problem illustrates an important feature of the Wright-Fisher 
model. The magnitude of change in allele frequency is greater when the allele 
frequency is V 2 than it is when the allele frequency is more skewed. The 
changes are greater because the variance in the binomial sampling distribu- 
tion is greatest when p = l / 2 . (The formula for the binomial variance is pq/ 2b!.) 
The variance drops to zero a« p = and p = 1. The variance formula makes it 
clear that a large population will change allele frequency more slowly than a 
smaller population because the sampling variance varies as the reciprocal of 
population size. Furthermore, the probability of an increase in allele frequen- 
cy is the same as the probability of a decrease in allele frequency, regardless 
of the allele frequency. The process of drift does not recognize when a popu- 
lation is close to a fixation. The chance of drifting up in frequency is always 
equal to the chance of drifting down in frequency, regardless of the current 
population allele frequency. 

Fisher and Wright also addressed the expected time to fixation. Since 
another approach yields the solution to this problem much more easily, we 
will consider times to fixation in the next section. 

PROBLEM 7.4 Simulating random drift can be a very time-consunv 
ing proposition. If one wants to simulate a population of 1000 indi- 
viduals for 1000 generations, one has to draw 10 6 random numbers 
and for each decide whether to accept or reject each genotype. Kirrm- 
ra (1980b) came up with a shortcut that relates very closely to how the 
diffusion approximation works (see the next s ection). The trick is to 
use the recursion: p' * p + (2U - l)V(3p?/2N), where U is a random 
number uniformly distributed between and 1. Each generation, one 
picks one random number U, and a sample realization of the next 
generation's allele frequency is gotten from the above recursion. Why 
does this approach work? (Hint: The variance in a uniform distribu- 
tion is the square of the range divided by 12.) 

Random Genetic Drift 277 

ANSWER The expression 2U - 1, where U is a number between 
and 1, g ives a val ue from -1 to +1, or a range of 2. The range of 
(2li - 1)V(3/mj/2N) is therefore 2^3pq/2N). Squaring this expression 
and dividing by 12, the variance of this uniform random variable is 
thus pqflN, just what we get from a binomial sampling distribution. 
Each generation the allele frequency has an equal chance of increas- 
ing or decreasing, and the variance in the allele frequency change is 
pq/2N. Even though the distribution of change in allele frequency is 
uniform in the pseudosarnpling simulation instead of binomial (as it 
is in the Wright-Fisher model), this process can reproduce most of the 
results of the complete brute-force simulation at a tiny fraction of the 
computer time. 


The pattern of change in allele frequency shown in Figure 7.4 is very nearly 
that expected theoretically for an ideal population, and although the full- 
blown theory of random genetic drift requires mathematics beyond the scope 
of this book, some background might be of interest (see Kimura 1955, 1964, 
1976, Wright 1969; Crow and Kimura 1970; Kimura and Ohta 1971). The rep- 
resentation of random drift by a differential equation was first applied by 
Fisher (1922), who noted that the equation describing the diffusion of heat 
through a solid bar applies to random genetic drift. The distribution of popu- 
lations with allele frequencies ranging from to 1 is called <p(x,t), where x rep- 
resents the allele frequency and I indicates time. Figure 75 shows a particular 
realization of ${x,t) changing through time. The theoretical problem is to for- 
mulate an equation that describes how $(x,t) changes under random genetic 
drift, and to solve the equation. 

The parallel between the physical process of diffusion and what is actual- 
ly going on in a finite population is a bit abstract, but it is not particularly dif- 
ficult. Consider an axis of allele frequency extending from to 1 . The number 
ol populations whose frequency is between x and r + Bx at time t is our prob- 
ability density ${x,t). Populations may enter this range of allele frequencies 
by drifting in from a lower frequency, which occurs with a probabilify flux 
](x,f). Populations may leave this range of allele frequencies by drifting out, 
which occurs with probability flux )(x + Bx,f). The rate of change in <j>(x,t) is 
the difference in these fluxes, which we can write 

278 Chapter 7 

•:•„..«, 7 i Portion of the Wrieht-Fisher model for the distribution *(x,0 of 
count (see Figure 7.12). 

I^0 = -f ;(x,o 




because /(*,,) - J[x + 3-*, ~- -£ J(.U)when 3* is very .mall. 
The probability flux is 

where Mfr) is the average change in allele frequency in a population whose cur- 
7ent allele frequency is I and V(x) is the variance in change in allele frequency. 


Random Genetic Drift 279 

M(x) is zero unless there is some force, like mutation or selection, driving the 
allele frequency to change in a particular direction. (Remember that with pure 
drift, allele frequency increases and decreases with equal chance ) V[x) tells how 
fast allele frequencies change for a population with frequency „y Under the 
Wright-Fisher model, V(x) = x(l - x)/2N, which means that the binomial sam- 
pling variance describes the magnitude of allele frequency change. But the prob- 
ability flux depends on the difference in rates of change from x to x + dx, and so, 
just as m classical physical models of diffusion, the flux depends on the gradi- 
ent in whatever is diffusing. In the case of chemical diffusion, this gradient 
would be the gradient in concentrations, and a greater difference in concentra- 
tions would yield higher flux. In the case of our population genetic model, the 
gradient is the change in sampling variance as x is varied, or dV(x)/dx. 
Substituting Equation 7.4 into Equation 7.3, we get 


M(x)<p(x,t)-±^V(x)<l>(x r t) 

-~lM(xW,l)] + ^\V(x)4{x t t)] 



2 dx- 

Equation 7.5 is known as the diffusion equation, the forward Kolmogorov 
equation, or, in the context of the physics of heat diffusion, the Fokker- Planck 
equation. For the Wright-Fishei model, M{x) = and V{x) = x(1 - x)/2N so 
we get 



4W dx 


Many aspects of this problem were explored by Wright (1931), and the 
formal solution to this equation, found by Kimura (1955), required some 
heavy mathematics. For our purposes, some graphs will illustrate the impor- 
tant properties of the diffusion equation. The two families of curves in Figure 
7 6 are the theoretical distributions <p{x,t) of allele frequency among unfixed 
populations after various times (f) measured in units of N generations In Fig- 
ure 7.6A, all populations have an initial allele frequency of V2, as in the actual 
populations in Figure 7.4; after about t = 2N generations, the distribution of 
allele frequency is essentially flat, and by this time about half the populations 
are still unfixed. The distributions in Figure 7.6 refer only to those popula- 
tions that are unfixed; as time goes on, more and more of the populations 
become fixed, and the distributions progressively pile up at and 1, as in the 
histograms in Figure 7.4. Indeed, in Figure 7.6, the area under each curve is 
equal to the proportion of unfixed populations, which becomes progressive- 
ly smaller. In particular, the rate at which the height of the distribution 
decreases once it becomes flat is about 1 /2N per generation. 

280 Chapter 7 


~ 3 - 

o 2 - 

A' = "' 

/V' = ! 


/\' = ! 


y7» = wv^ 

y f = 2N VV^ 

"/* f 



i v_ 

Allele frequency 


Allele frequency 

Figure 7.6 Theoretical results of random genetic drift. (A) Initial allele 
frequency = Vi- (B) Initial allele frequency = 0.1. The curves have been scaled so 
that the area under each curve is equal to the proportion of populations in which 
fixation or loss has not yet occurred. The curves are therefore the distributions of 
allele frequencies among segregating populations. (From Kimura 1955.) 

Figure 7.6B shows what happens when the initial allele frequency is 0.1; 
here the distributions are highly asymmetrical, and the distribution of allele 
frequency does not become flat until about t = AN generations, by which time 
only about 10% of the populations remain unfixed. Once a flat distribution of 
allele frequency is reached, the distribution remains flat, but random drift 
continues unlil fixation or loss has occurred in all populations. 

PROBLEM 7.5 Demonstrate from Equation 7.6 that the fixation 
states, x = and x = 1, are equilibria of the diffusion process. 

ANSWER A condition for equilibrium is that (fr(x,f) remains station- 
ary, so that |-0{>,f) = O. Substituting into Equation 7.6, we get 

Random Genetic Drift 281 

i a 2 

°~ 4W fc5"N 2 -*)*M]- H * = or x * 1, this equality is dearly 
satisfied, because * fn] = a It ^^ a W| more W(>rk fo show ^ 

these are the only equilibria. 

To illustrate fhal the diffusion approximation and the Wright-Fisher 

Ztl H 7 Ve Z Simih * TeSU] ! S ' FigUre 7 " 7sh0WS ^diffusion approximation 
or the data ,n F Ig ure 7.4, with 2N = 32, * = % and , running from generation 
I through generation 19. 

Seal n M \TZ S ] T } u°' Ut r iQ the diffusion ^ uation *™ *e particu- 
enS L H^r S ' S tHe three - dimp ™<™I view of Figure 7.6, and repre- 

Writh v I fuS,ca \ a P Prowmafon to the exact solution obtained by the 
Wright-Fisher model in Figure 7.5. y 

282 Chapter 7 

Absorption Time and Time to Fixation 

One useful application of the diffusion approximation has been to determine 
expressions for the expected time for a neutral allele to go to fixation. Assum- 
ing that the allele starts at frequency p, Kimura and Ohta (1969) showed that 
the mean time (in generations) until the allele is fixed (ignoring cases where 
the allele is lost) is 

1 (p) = -^[(l-p)log(l-p)] 


Similarly, they showed that the mean time to loss of the allele is 

'o(p) = --f^[pMr)] ' 7 - 8 

Combining Equations 7.7 and 7.8, the mean persistence time of an allele is 


(p) = -4N[p]og(p) + {l-p)]og(l-p)] 

where T(p) is the average time that an allele remains segregating in a popula- 
tion (that is, until its frequency is either or 1). 

The average times for fixation, loss, and persistence of a neutral allele 
are shown graphically in Figure 7.8 An allele is expected to remain in a 
population for the longest time when its initial frequency is >/ 2 . When 

4 6 

Initial allele frequency 

Figure 7.8 Average persistence ol a neutral allele in an ideal diploid popula- 
tion of size N, plotted against initial allele frequency. 

Random Genetic Drift 283 

p,i = V 2 r the average time that a population remains unfixed is about 2.77/V 


Consider a set of four subpopuJations each started with allele frequency p = 
'/ 2 ,and each undergoing random drift independently following binomial sam- 
pling (Figure 7.9). Within any particular subpopulation (call it subpopulation 


After 1 39N generations 

After fixation 

Figure 7.9 A schematic diagram showing a set of four populations undoi go- 
ing the process of drift. Initially the allele frequency is V 2 m all four populations, 
and the average heterozygosity is '/ 2 . As the populations drift in allele frequency, 
the average is expected to remain the same (indicated by p remaining '/-,) but the 
average heterozygosity decreases. (Genotype frequencies are given for the inter 
mediate generation.) Finally, all populations go to fixation, half fix one allele 
and half fix the other, so the average allele frequency is still '/ 3 , but the hetero- 
zygosity is zero. 

284 Chapter 7 

number i), mating is random because all the assumptions in Table 7.1 hold 
true If the allele frequencies of A and a in the ?th subpopulation are denoted p, 
and q„ then the genotype frequencies of AA, Aa, and aa are given by the famil- 
iar Hardy- Weinberg principle as p, 2 , 2p,q, and q, z . Furthermore, picture the sit- 
uation in Figure 7.9 at a time so advanced that all subpopulations are fixed for 
one allele or the other. Within the ith subpopulation, therefore, either p, equals 
or p, equals 1. The genotype frequencies of AA, Aa, and aa in that subpopu- 
lation are either 0, 0, and 1 (if p, = 0), or I, 0, and (if p, = 1 ). These genotype 
frequencies, though extreme, still satisfy the Hardy- Weinberg principle. Thus, 
within any one subpopulation in Figure 7.9, the frequency of heterozygotes is 
that expected with random mating. 

The situation regarding the total population in Figure 7.9 is very different, 
however, as there is an overall deficiency of heterozygotes. Suppose that we 
sample the four subpopulations, but that we are unaware of the existence of 
the four subpopulations, and instead we think that the sample contains a sin- 
gle randomly mating population. Considering the four populations at the 
right side of Figure 7.9 (after all variation is lost), if we were to calculate the 
allele frequency we would obtain p = Vi- We would then naively expect a frac- 
tion 2pq = Vz of the genotypes to be heterozygous. In fact, we would have no 
heterozygotes at all in our sample! This rather paradoxical result— that there 
is a deficiency of heterozygotes in the total population even though random 
mating occurs within each subpopulation— is a consequence of the random 
genetic drift of allele frequencies among subpopulations due to their finite 
size. This extreme case when each subpopulation is fixed is easy to under- 
stand: a population with allele frequency V 2 could only be made up of two 
subpopulations each fixed for A, and two subpopulations each fixed for a. 
The entire population has no heterozygotes whatsoever, but the average 
allele frequency is V 2 . The total population has a deficiency of heterozygotes, 
much as if there were inbreeding. This inbreeding-like effect of population 
subdivision is known as the Wahlund principle (Chapter 4), and we are now 


(1) Diploid organism 

(2) Sexual reproduction 

(3) Nonoverlapping generations 

(4) Many independent subpopulations, each of constant size N 

(5) Random mating within each subpopulation 

(6) No migration between subpopulations 

(7) No mutation 

(8) No selection 

(F = 1) 

Random Genetic Drift 285 

f -l°oooooooo 

1\ I I I 1 I J I 


Figure 7.10 Diagram illustrating the reasoning behind the recursion for F in a 
finite population. When the gametes are drawn to make up the population at 
generation f, there is a chance 1/2N that any pair of alleles will be drawn in 
generation t ~ 1. If this happens, the probability of identity is 1. For the allele 
pairs drawn in generation t from two distinct alleles at generation f - 1 (the 
probability of this is 1 - 1/2N), the probability of identity is F M . Adding the 
probabilities of these two events, we get F, = 1 /2/V + (1 - 1 /2W)F M 

in a position to quantify the manner in which subpopulations diverge in 
allele frequency under random genetic drift. 

In Chapter 4 we measured the extent of inbreeding with the inbreeding 
coefficient, F. F is the probability of autozygosity, or the probability that an 
individual carries a pair of alleles that are identical by descent (derived from 
a common ancestor). Even though random mating occurs within each sub- 
population in Figure 7.9, because gametes do combine at random, any two 
alleles in a subpopulation may be identical by descent due to the limited pop- 
ulation size. Thus F, does not equal zero. The value of F, can be calculated as 
in Figure 7.10. This figure shows the 2N alleles in a breeding population of 
generation f - 1. In sampling alleles for generation /, the first chosen allele 
may be any of those present in generation t - 1 with equal chance. The prob- 
ability that the second chosen allele is of the same type as the first is 1 /2N, 
because this is the frequency of each allelic type in the gametic pool; the 
probability that the second chosen allele is of a different type from the first is 
accordingly I - 1/2/V. In the first case, the probability of identity-by-descent 
is 1; in the second case it is F M . Altogether the recursion is 



Multiplying both sides by-] and adding 1 leads to 




286 Chapter 7 

TOO 150 200 

Gent-rations (0 

Figure 7. 11 Increase of F, in idea! populations as a function of time and effec- 
tive population size N, 

or, when F - 0, 



Figure 7.1 1 shows the rapid increase of F, in small populations. Another 
aspect of the same phenomenon can be appreciated by the probability of 
drawing a pair of alleles that are not identical by descent This probability is 
the same as the heterozygosity, and it can be written 

H, = 1 - F, 


By substitution for F, we obtain the rate of change in heterozygosity from 
random genetic drift 

H, = 1 


H r _ 


and so 

H ' = ( 1_ ^)' Hn ' BH0f "' / 

Random Genetic Drift 287 


Recall again that a single population undergoing random drift remains 
in approximate Hardy-Weinberg proportions, and that the symbol H, rep- 
resents a sort of "virtual heterozygosity" averaged across many subpopu- 
lations. The above equations show that pure random drift should result in 
the heterozygosity decreasing at a geometric rate, since U, is multiplied by 
the constant (1 - 1 /IN) each generation Experimental tests of this predic- 
tion are shown in Figure 7.12 Figure 7 12A shows how the heterozygosity 
averaged across the populations in Figure 7.4 declines over generations, 
but the theoretical curve when N = 16 does not fit the data very well, 
[n fact, the rate of decline of heterozygosity is gteater than the theoretical 
expectation, as though the population size were smaller than N - 16 On 
the other hand, the allele frequency, averaged across populations, is not 
expected to change, and the data agree with this aspect of the theory quite 
well (Figure 7.1 2B). 

PROBLEM 7.6 Use Equation 7.15 to determine how long it takes for 
a finite population to halve in heterozygosity. 

ANSWER Set V 2 H = H e"<' /2N> . Dividing out the H n , and taking log- 
arithms, we obtain 



so I - 1.39N. In other words, it takes 1.39N generations to halve the 
heterozygosity, regardless of its initial value. Fisher expressed this 
result by saying that it takes 1 39N generations to halve the genie vari- 
ance in the population. Since the variance of a binomial sample is 
pq/2N, and the variance in allele frequency among subpopulations is 
proportional to the heterozygosity, it follows that both the variance 
and the heterozygosity decrease at the same rate. 

288 Chapter 7 

§3 04 

Points from Tigurc 7 4 

Theoretical cm ve 
with N = 16 

I ° 


Theoretical curve 
with N = 9 

5 10 

Generation (f) 


> o 

re -n 


K 3 
Sg. 06 

g5 04 

8" 2 

'Theoretical curve ' " ' *" 

Points from (A) 

5 10 15 19 

Generation (f) 

Fiqure 7.1 2 Theoretical curves for average heterozygotes (A) with N = 9 or 
N = 1 6, along with actual values (plotted as points) from the experiment in Fig- 
ure 7 4 In (B) the observed and expected allele frequencies (averaged across the 
107 subpopulations) are plotted. (Data from Buri 1956 ) 

Several important consequences of the population structure in Figure 7.9 
can now be summarized. First, although each subpopulation is finite in size, 
we can imagine so many of them that the size of the total populate is effec- 
tively infinite. For an infinite population that obeys the assumptions in Table 
7 1 the allele frequencies must remain constant. That is, even though the 
allele frequency in any individual subpopulation may change willy-nilly due 
to random genetic drift, the overall average allele frequency of A among sub- 
populations remains P(lr where p represents the allele frequency of A m he 
base population. Figure 7.12B gives an experimental demonstration of the 

Random Genetic Drift 289 

constancy of average allele frequency. Since F, is the probability of atitozy- 
gosity of a gene in an individual in generation t t the probability of alto/ygos- 
it|y (obtaining a pair of alleles that are not identical by descent) is 1 - F,. 
Because p„ is the overall allele frequency of A, the probability that a rnndom- 
lyichosen individual will be genotypicafly AA is ;^(1 - F,j [for the case of 
allb7yg0sity] + p F, [for the case of autozygosityj. Similarly, the probability 
that the individual will be An equals 2/),fl n (1 - F,); and the probability that the 
individual will be an equals flofl - F,) + t] t) F,. Note that the genotypic 
frequencies in the total population are different from the standard Hardy- 
Weinberg proportions, because there is an apparent excess of homozygotes. 
However, within any one subpopulation, the genotypic frequencies still obey 
the Hardy-Weinberg principle because of random mating. Substituting for 
F, in Equation 7.12 implies that the average heterozygosity among subpopu- 
lations at time f equals 2mo(1 - f i) = 2/WoO - 1 /2N)'; this is the theoretical 
curve plotted in Figure 7.12A (with p = q Q = '/,) 

Since F, eventually goes to 1, all subpopulations eventually become 
fixed for one allele or the other. Because the average allele frequency of A 
remains p Q even when all subpopulations have become fixed, the proportion 
of subpopulabons that eventually become fixed for A must be rj (and the 
proportion that eventually become fixed for a must be q ) Stated another 
way, the probability of ultimate fixation of an allele in any ideal subpopu- 
lation is equal to the frequency of thatjallele in the initial population This 
point is illustrated by the actual exanjple in Figure 7.4, where p„ = >/ 2 ; by 
generation 19, a total of 58 populations have become fixed, 30 for the bxv 
allele and 28 for W 5 . I 


As we saw in the Drosophih experiments in Figure 7,12, populations general- 
ly fluctuate in allele frequency by an a 1 mount greater than pq/2N The reason 
is that no real population obeys all the assumptions in Table 7.1 exactly. In 
any actual case, there must be corrections for such complications as fluctua- 
tions in population size, unequal number^ of males and females, age struc- 
ture, and skewed distributions in family size (see Crow and Kimura 1970) 
The degree to which genetic drift can change allele frequencies, and the rates 
of allele fixation by drift, can be approximated under these complicating cir- 
cumstances by calculating the effective size of the population and using this 
value in the theory for an ideal population. That is, the effective population 
size of an actual population is the number of individuals in a theoretically 
ideal population having the same magnitude of random genetic drift as the 
actual population. There are three kinds of effective population size based on 
how we choose to measure "magnitude," namely: (1) the change in average 
inbreeding coefficient, (2) the change in variance in allele frequency, or (3) the 

290 Chapter 7 

rate of loss of heterozygosity These are called the inbreeding effective size, the 
variance effective size, and the eigenvalue effective size, respectively. 

Wright (1 931) First worked out the el fective population size by considering 
the effective degree of inbreeding in various situations. As noted, the effective 
population size can also be calculated by determining the rate of change m 
variance in a population, and Kimura and Crow (1963) first applied this 
approach to the problem of overlapping generations. Usually, the inbreeding 
effective size and the variance effective size are the same, but exceptions do 
occur Similarly, the variance effective size and the eigenvalue effective size 
can be distinct (Ewens 1982), Some of the various factors that require calcula- 
tion of an effective population size will now be illustrated. We will focus on 
the inbreeding effective size because this concept is the most widely used. 

Fluctuation in Population Size 

Correction for fluctuating population size is important because natural pop- 
ulations actually do change in size, sometimes by a factor of 10 or more in a 
single generation. For the sake of simplicity, assume that the population is 
ideal in all respects except that its size is not constant. We will consider the 
situation over just two generations. Suppose that the population sizes in two 
successive generations are N and N,. The arguments laid out in Figure 7.10 
imply that 





Substituting from the second equation into the first leads to 

1 - F, = 1 - 





By analogy with the constant N case, it is appropriate to try to express this 
equation in the general form 


where N is now the effective population size. In our example f = 2, so 



Random Genetic Drift 291 

Setting the two expressions for 1 - F 2 equal to each other we obtain 

s 2 

2N) { 2N 

\ — 



from which 1/N = V 2 (l/N n + 1/N,) turns out to be an excellent approx. 
tion. In general, 

1 1 

— + — + , 

t[N N, 



and so the effective size N P is the harmonic mean of the actual numbers— 
the reciprocal of the average of reciprocals. As illustrated m the problem 
below, the harmonic mean tends to be dominated by the smallest terms, m 
biological reality, this means that a single period of small population size, 
called a bottleneck, can result in a serious loss in heterozygosity. Popula- 
tion bottlenecks are thought to account for the very low levels of polymor- 
phism found in extant populations of the elephant seal (Bonneli and 
Selander 1974) and the cheetah (O'Brien et al. 1985, 1987). A severe popula- 
tion bottleneck often occurs in nature when a small group of emigrants 
from an established subpopulation founds a new subpopulahon; the 
accompanying random genetic drift is then known as a founder effect (see 
Holgate 1966; Nei et al. 1975; Chakraborty and Nei 1977; Neel and Thomp- 
son 1978). Founder effects in human populations have implications in med- 
ical genetics, because human populations derived from small numbers of 
founders may have an elevated incidence of an otherwise rare genetic dis- 
order. Examples include Tay-Sachs diseases in Ashkenazi Jews, dystrophic 
dystrophy in Finns, familial hyperchylomicxonemia in Quebecois, and con- 
genital total color blindness in Pingelap Islanders, In addition to reducing 
the effective size, and thereby increasing F, population bottlenecks and 
founder effects may affect many other aspects of the genetic variation, 
including causing a reduced number of alleles, a distorted distribution of 
numbers of molecular site differences among alleles, and an increased level 
of linkage disequilibrium. 

PROBLEM 7.7 Suppose a population went through a bottleneck as 
follows: N Q = 1000, Nj = 10, and N 2 = 1000. Calculate the effective size 
of this population across all three generations. 

292 Chapter 7 

ANSWER Using Equation 7.22, we get 1/N = <1/3)(1/1000 + 1/10 
+ 1/1 000) = 0.034, or W = 1 /0.034 = 29.4. The average effective number 
over the three-generation period is only 29.4, whereas the arithmetic 
average number of individuals is (1/3)(1000 + 10 + 1000) = 670. 

Unequal Sex Ratio, Sex Chromosomes, Organelle Genes 

A second important case in which the effective size of a nonideal population 
can readily be calculated concerns sexual populations in which the number 
of males and females is unequal. This inequality creates a peculiar sort of 
"bottleneck"; because half of the alleles in any generation must come from 
each sex, any departure of the sex ratio from equality will enhance the oppor- 
tunity for random genetic drift. This situation is important in wildlife man- 
agement, where, for many game animals (pheasants and deer come 
immediately to mind), the legal bag limit for males is much larger than for 
females. Although some management goals are served by such hunting reg- 
ulations (for example, the species involved are usually polygamous, so one 
male can fertilize many females and overall actual population size can be 
maintained), it must be remembered that the resultant inequality in sex ratio 
reduces the effective population size. Specifically, if a sexual population con- 
sists of N m males and Nf females, the actual size is 

However, the effective population size is 

4N m N f 

K = 

N m +N f 


7 24 

Figure 7.13 shows the relationship between sex ratio and the reduction in 
effective population size. To take a realistic example, if hunting is permitted 
to a level at which the number of surviving males is one-tenth the number of 
females, then the effective population size is a mere one-third of the actual 
number of individuals in the population. 

A related problem is the effective population size for an X-linked gene. In 
this case, the variance effective population size is 

N r = 

9N„,N f 
4N„, + 2N, 


Random Genetic Drift 293 

40 60 

Percent female 

Figure 7. 1 3 Effective size falls off rapidly in populations with a skewed sex 

Equation 7.25 can be justified by noting that the sampling variance for the 
X chromosomes from males is p„,cj m /N m , whereas the sampling variance for X 
chromosomes from females is p f q f /lN f , in which p,„ and p f are the frequencies 
of allele A in males and females, respectively The frequency of an ,4-bearing 
X chromosome in the population is 

P= 3 Pm + 3 Pf 

and the sampling variance of p is 

At steady state, p m = p f = p, so pq can be factored out, giving 

Var(p) = pq 



+ 1-1 
9 2W 



9N„,N f 

4N m +2N f 




The term in the square brackets corresponds to the N c in Equation 7.25. It 
shows why this is a variance effective size: the binomial sampling variance in 
an ideal population is pq/2\N e ], 

294 Chapter 7 

PROBLEM 7.8 What is the effective population size for mitochon- 
drial DMA? (Assume transmission is exclusively from mothers to all 
offspring.) What is the effective population size for a gene on the Y 
chromosome, given that the population consists of N diploid individ- 
uals and all other assumptions of Table 7.1 apply? (Assume XX indi- 
viduals are female and XY individuals are male.) 

AN SWE R Mitochondrial DN A is transmitted essentially exclusively 
by females. The chance of drawing two mtDNAs that are identical by 
descent is 1 /N f , where N/is the number of females in the population. 
Hence the effective size is simply ty Similarly, the effective popula- 
tion size for the Y chromosome is N„, the number of males in the pop- 
ulation. Note that even though mtDNA is present in all individuals, 
while the Y is present only in males, the effective size of mtDNA is not 
larger. Effective size depends on the sampling properties of a gene, 
which depends on the gene's transmission, not just on how many 
individuals carry the gene. 


There arc many forces in population genetics that act in opposition to one 
another, and it is this tension that makes for interesting behavior at the pop- 
ulation level. Mutation always increases the amount of genetic variation in a 
population Random genetic drift results in the loss of genetic variation. 
Merely because these two forces are in opposition, it does not guarantee that 
there will be a stable balance between them. In order to formally ask whether 
the two forces do balance, we need to be careful to specify assumptions about 
the processes of mutation and drift. We already examined one such model in 
Chapter 5 — the infinite-alleles model— and we saw that in this case the forces 
do in fact balance to provide an equilibrium level of neutral variation. Let's 
consider this model once again, in somewhat more detail. 


As we saw in Chapter 5, the infinite alleles model starts with the assumption 
that each mutation produces a novel allele, never before present in the popu- 
lation Mutations occur such that each gene in the population has an equal, 
but low, chance of mutating. Random genetic drift occurs in the manner of 

Random Genetic Drift 295 

the Wright-Fisher model— each generation the population is reconstituted by 
drawing a sample with replacement from the current sample of alleles 
Under these assumptions we saw that the equilibrium probability of identi- 
ty, F, could be approximated as 

F = 


4Nu + l 


The number of selectively neutral alleles increases under mutation pres- 
sure until F satisfies this equation. 

PROBLEM 7.9 Derive an expression for F in a finite population with 
mutation and migration. 

ANSWER First assume that there is no mutation, and that new 
migrant alleles arrive from another population at a rate mi per genera- 
tion M in the balance between mutation and drift in the infinite- 
alkies model, we note that alleles can be identical by descent by being 
drawn twice (with probability 1/2N) or by having two different alle- 
les drawn but having them be 1BD from the previous generation. The 
e<|uilibrium autozygosity can be written 


because (1 - m) is the probability that neither of the two randomly 
chosen alleles comes from a migrant. By analogy with the infinite- 
aUeles model, we get (in the case of migration with no mutation) 

F = 


4Nm + l 

When both migration and mutation are occurring, alleles are identical 
by descent only if they neither mutated nor migrated, and this occurs 
with probability 1 - m - ji. Thus, thensgntlibrium autozygosity is 

4N(m + ii) + l 

296 Chapter 7 

F is the probability of autozygosity, and //, which can be thought of as het- 
erozygosity/ is also the probability of drawing a pair of alleles that are not 
autozygous. Since H - 1 - F, under the inhmte-alleles model, the equilibrium 

of His 

H = 



l + 4Nu 9 + 1 


where 9 = 4Nu. 

The relationship between the quantity 4Nu and H was encountered in 
Chapter 5 and is plotted in Figure 5.7. For a per -locus mutation rate of 10~ 6 
and a population size of 250,000, we get 9=1, and so H = V 2 . Note that 
increases in population size have precisely the same effect as increases in 
mutation rate. Heterozygosity approaches one only if population sizes are 
very large (such as in microbial organisms) or if mutation rates are very high 
(such as at some microsatellite loci) Next we will consider how one might go 
about testing whether a sample from a population exhibits a pattern of genet- 
ic variation that is compatible with the infinite-alleles model. 

The Ewens Sampling Formula 

The infinite-alleles model has an "equilibrium" when H = 47Vu/(l + 4Nu). 
Tins is not an equilibrium in the usual sense. In reality, allele frequencies are 
always changing, new mutations continue to come into the population, and 
eventually they are eliminated, even perhaps, after becoming fixed for some 
time. The term steady state is probably more appropriate for this kind of 
behavior, since the alleles are not maintained at a constant frequency, but 
rather new ones are entering and old ones are leaving the population. The 
population remains at a steady state in the sense that the number of alleles, 
and the level of autozygosity, remain stationary. If the number of alleles and 
the level of autozygosity remain steady, then it is reasonable to assume that 
there is also a steady-state distribution of allele frequencies. By steady-state 
frequencies we mean that the most common allele always has a frequency of 
p u and the next most common has a frequency of p 2f and so on. The steady- 
state distribution has the curious property that, even though the most com- 
mon allele is expected to have a frequency of p u the identity of the most 
common allele is expected to change with time. In the steady-state popula- 
tion, not all alleles are equally frequent, and F is greater than it would be 
were all alleles equally frequent. 

Consider the steady-state distribution of allele frequencies from the point 
of view of an experimenter taking a sample from a population. Let the sam- 
ple size be // genes, and suppose there ate k different alleles in this sample. 
The sample might consist of, for example, 10 unique alleles, 3 alleles that are 
represented twice in the sample, 7 alleles that are present 3 times, and so on. 
Such a description of the sample is called the allelic configuration or parti- 

Random Genetic Drift 297 

tion. A remarkable finding of Ewens was that the expected configuration of a 
sample drawn from a population obeying the infinite-alleles model is entire- 
ly determined by the sample size, n, and the number of observed alleles k 
Ewens showed that the expected number of alleles in the sample, given 9 and 
the sample size, is 

E(fc) = l + _^ + _^- + . + __JL„ 

e+i e+2 e+«-i 


If 6 is very small, E(k) « 1, whereas for very large 8, E(k) approaches n, 
implying that for a large enough population with a high enough mutation 
rate, every allele that is sampled will be different The form of Equation 7.31 
suggests that, as the sample size increases, more alleles will be found, but 
that there is a diminishing return in finding new alleles when the sample size 
increases. When E(k) is plotted against (Figure 7.14), the increase in the 
expected number of alleles is greatest for larger sample sizes when the popu- 
lation is highly diverse (large 0). 

The infinite-alleles model gives a steady-state prediction of F given 9 
(because F = 1/(1+9) from Equation 7.30), and a prediction of It Irom Equa- 
tion 7.31. Combining these predictions, the expected relation between Fand 
k is plotted in Figure 7.15. The hyperbolic relation is not surprising, because a 
population with many alleles will generally have a lower probability of iden- 
tity of a randomly chosen pair of alleles. For 6 = 1, the expected F is V 2 for all 

Figure 7.14 Relations between 6, the expected number of alleles, and the sam- 
ple size according to the Ewens-Watterson sampling theory of a population in 
steady-state under the infinite-alleles model of neutral mutation 

298 Chapter 7 

1.0 r 

OR - 

[6=0 1 

e = io 

ir = 50 if = 100 

n = 250 

4 6 8 10 12 H 16 

Expected number of alleles, E(k) 

18 20 

Figure 7.15 The infinite-alleles model prediction of the relation between the 
expected number of alleles and the expected gene identity F. The three curves 
represent a range of values of 9 = 4Nu, starting at 6 = 0.1 in the upper left, and 
ending with 8 = 10 in the lower right For the value of B = 1, the expected F, 
given by the relation F = 1 /(l + 6), is '/ 2 , regardless of the sample size. Larger 
sample sizes always lead to larger expected numbers of alleles, but the differ- 
ence is greater in more diverse populations (those with smaller F). 

sample sizes, but a larger sample size should yield a greater number of dis- 
tinct alleles. 

The Ewens-Watterson Test 

The Ewens sampling theory expressed in Equation 7.31 shows that the sam- 
ple size and the number of distinct alleles observed in the sample are suffi- 
cient to give an expected configuration of allele counts. From the observed 
and expected configurations, a number of test statistics can be devised to 
determine whether the observed sample fits the expected values of the 
model. Figure 7.16 shows histograms of the observed and expected allele fre- 
quency configurations for alleles in a human population defined by a VNTR 
polymorphism. In this particular example, there appears to be a slight excess 
of the common allele, which is consistent with any number of causes of 
departure from the infinite-alleles model 

Keith et al (1985) isolated 89 homozygous lines from a sample of 
Drosaphila pseudiwbficttra collected at the Gundlach-Bundschu Winery in 
Sonoma Valley, California. Homogenized tissue from these 89 lines was then 

Random Genetic Drift 299 



£ 02 

IllliTfMlTgT m—. 

More common 

-*• Less common 

Allele rank 

Figure 7.16 Observed (open columns) and expected (black bars) allele fre- 
quency distribution of the HRAS-1 locus in humans, identified by Southern 
blotting with the pLMO.8 probe and Taql digests. Observed data are from Baird 
et al. (1986), and the expected distribution was generated using Ewens' sam- 
pling theory In this sample of 490 genes there were 14 distinct alleles, four of 
which were present in fust one individual (From Clark 1988.) 

subjected to sequential electrophoresis (a sensitive means of detecting charge 
and conformation differences among the protein products), and stained to 
reveal differences in xanthine dehydrogenase {Xdh) mobility. They obtained 
a common allele that was present in 52 of the lines, one allele that was pre- 
sent in nine lines, one allele that was present in eight lines, two alleles present 
in four lines each, two alleles that were present in two lines each, and eight 
singleton or unique alleles. 

To test whether the observed configuration fits the expectation, a comput- 
er simulation was run to generate realizations of samples from populations 
that obey the infinite-alleles model, having the same number of alleles and 
sample size as the observed data. The algorithm to do this simulation is 
described by F Stewart in the Appendix to Fuerst et al. (1977), and a listing of 
a program can be found in Manly (1985). From each computer-generated sam- 
ple, F is calculated as the sum of the squared allele frequencies Figure 7 17 
shows a histogram of the computer-generated distribution of F, along with 
an arrow showing where the Dmsophila sample fell The sample had an 
observed F that fell in the upper tail of the distribution, and since so few val- 
ues ol F from the null hypothesis were larger than the observed F, Keith etal. 
rejected the null hypothesis and argued that the data did not fit the infinite- 
alleles model satisfactorily. The departure was in the direction of excess 
homozygosity (deficit of H), but since the populations were probably in 
Hardy-Weinberg proportions, a clearer way to state the result would be to say 
that there was a deficit of genetic diversity for the given number of observed 
alleles. The deficit means that the common allele is more common than 

300 Chapter 7 

observed F = 3657 



Figure 7.17 Computer-generated distribution of F obtained from 1000 sam- 
ples from a population obeying the assumptions of the infinite-alleles model 
with k = 15 alleles and a sample of size n = 89 (as in the Xdh data irom a sample 
of Drosophiltt psciidmbsam from the Gundlach-Bundschu Winery studied by 
Keith et al. 1985). The mean of F trom the simulation wasQ 168, which is well 
below the observed F of 0366 A significant departure of the observed F from 
the predictions of the model is noted by the small area under the tail of the dis- 
tribution to the right of the arrow 

expected, and there are also more singletons than expected. This pattern of fre- 
quencies is consistent with purifying selection acting to eliminate the rare, 
slightly deleterious alleles that continually enter the population by mutation. It 
is also consistent with an historical effect in which many alleles may have been 
previously lost and the population has not yet had time to return to equilibrium. 
The results of the Ewens-Watterson test can also be reported graphically 
as in Figure 7.18. Each gene yields a point specified by the number of distinct 
alleles and the observed F. The two curves represent the 95% confidence 
interval generated by the Ewens sampling theory. A quick check of the con- 
cordance of the data with the model can be made by seeing whether points 
remain in this confidence region. Although Xdli in Drosophih psettdoobscure 
provides a dramatic departure from the infinite-alleles model, results like 
those plotted in Figure 7.18, which show an acceptable fit to neutrality, are 
more commonly obtained. 


Rather than considering each mutation as generating a unique allele, with 
infinitely many possible alleles, we can instead consider an allele as a 
sequence of nucleotides with mutation altering a site in the sequence. If the 
mutation rate is sufficiently low, then most sites will be monomorphic, and 
all polymorphic sites will be segregating for just two nucleotides Much of 

Random Genetic Drift 301 



Got \ 



G6PD *\ 






F 0.5 


m m X. 


PEP • . ptf 
V Aco 



\\ Adh f 






i i i i i j i 

• PG4 

i i i 

I 1 I 

JO 12 14 16 18 
Numbei of alleles {k) 

20 22 24 26 

Figure 7.18 Gene identity (F) plotted against the observed number of alleles 
in a sample of 279 E. colt. The solid lines represent the upper 97.5% and lower 
2 5% confidence limits, and the observation that all of the tested loci fall within 
these limits suggests good concordance with the infinite-alleles model of neutral 
mutation. (From Whittam et al. 1983.) 

the available data on allelic variation in DNA sequence seems consistent with 
this view: few nucleotide sites are segregating for more than two nucleotides. 
If the DNA sequence is sufficiently long and the frequency of polymorphic 
sites low, then most of the time new mutations will occur at sites that were 
previously monomorphic The infinite-sites model, based on these assump- 
tions, was developed by Kimura {1969, 1971), who considered nucleotides as 
unlinked, and by Watterson (1975), who took account of the nearly complete 
linkage among sites. 

The infinite-sites model is appealing because it directly addresses the type 
of data that molecular population geneticists can collect. Given an array of 
DNA sequences of alleles randomly sampled from a population, there is con- 
siderable information about the history of the alleles hidden in the patterns of 
similarity across alleles. The infinite-alleles model ignores this pattern and 
simply considers the alleles as distinct. A much more powerful treatment is 
to tabulate the number of sites at which all pairwise combinations of 

302 Chapter 7 

sequences differ, resulting in a so-called mismatch distribution. The infinite- 
sites model addresses the theoretically expected behavior of the mismatch 

distribution. Watterson (1975) considered the distribution of S„ defined as the 
number of segregating sites in a sample of i genes. For the case of a random 
sample of two genes, Watterson showed that the steady-state probability that 
the sequences have ; mismatches is 

■Ms.= = (^ 

» + l 


where 6 = 4N|u, and u is the mutation rate per gene (not per site). A particular 
case of this equation gives the probability that two sequences have no sites 
different, and hence are identical. Substituting i = into Equation 7.32, we get 

Pr(S 2 =0) = 


7 33 

in agreement with the inlinite-alleles model, because Pr(S 2 = 0) = F, the proba- 
bility that two alleles drawn at random are identical. The mean and variance in 
the distribution of number of segregating sites are 9 and 6 + 6 2 , respectively. 

In reality we do not sample an entire population, so it is important to deter- 
mine the statistical properties of a smaller sample drawn from a population. 
Often the sampling properties of population genetic models are very complex 
and we have to resort to simulations for meaningful eslimates A few results 
have been obtained for samples drawn from a population obeying the infinite- 
sites model, and these results are very useful foT testing goodness of fit to the 
model The expected number of segregating sites in a sample of n alleles is 


and the variance in the number of segregating sites is 



This expression for the variance is for the case of no intragenic recombi- 
nation It turns out that intragenic recombination does not affect E(S), but it 
reduces V{S) This is not hard to see intuitively — recombination shuffles the 
variation among alleles, reducing the average number of sites by which ran- 
dom pairs of alleles differ. The expression for the variance in the number of 
mismatching sites in the case of free recombination across sites is 

V(S) = 

n + \ 2(» 2 +H + 3) ) 
3(»-1) + 9h(h-1) 


Random Genetic Drift 303 

200 r 

IflHfc— J _ 

Number of segregating sites 

Figure 7.19 Equilibrium distribution of the number of mismatches between a 
pair of alleles. Note that if there is free recombination, the variance is smaller 
compared to the case of no recombination. 

Figure 7.19 shows the mismatch distributions for a simulated set oi data 
with free recombination (smaller variance) and with no recombination (larg- 
er variance). The relationship between the mean and the variance in the mis- 
match distribution can be used to make inferences about intragenic 
recombination (Hudson 1987). 

The assumptions of the infinite-alleles model and the infinite-sites model 
do not seem to be entirely at odds with one another, and we saw that they 
predict the same steady state value for F But the two models do make use of 
different aspects of the data, and so it would seem that a test of the consis- 
tency between the two models might serve as a useful test for the neutral the- 
ory. The next problem makes use of just this test, which was devised bv 
Tajima (1989). y 

PROBLEM 7.10 The average heterozygosity for pairs of randomly 
chosen alleys under the infinite-alleles model is E(k) = 8, and the 
expected number, of sites segregating in a sample (under the infinite- 
sites model) is 


304 Chapter 7 

Two estimates of are therefore k, the average heterozygosity, and 


Tajima (1989) devised a test statistic to test the null hypothesis that 
these two estimates were identical. The test statistic is the difference 
between these two estimates of B, or 

,=1 ' 

If a population were growing rapidly, one might expect this to affect 
both the number of segregating sites and the heterozygosity. Predict 
the direction of change of F (probability of identity), S {the number of 
segregating sites), and D (Tajima's test statistic). 

ANSWER First consider a larger population at equilibrium. Since 
F = 1 /(4Nu + 1), a larger population would have a lower F. A larger 
population would also have a larger number of segregating sites (S), 
and a higher per-site heterozygosity (fc). At equilibrium, if Ihe gene is 
neutral, then the Tajima statistic should be zero. In a growing popu- 
lation, F will decrease as added variation accumulates, S will 
increase and k will increase. The key point is that the increase in vari- 
ation will occur in initially rare alleles, which contribute to S but only 
a little to fc. Thus, S grows faster than k, and D will be negative. If the 
population stops growing, then Tajima's D statistic will return to 
zero at equilibrium. 


A sample of genes from a population represents more than a snapshot of 
counts of alleles in a population. Each gene that is sampled has an ancestral 
history dating back hundreds or thousands of generations. It is possible lhat 
a pair of genes sampled today may have come from identical copies of the 

Random Genetic Drift 305 

same allele produced by the same individual just a few generations ago Or 
the alleles may have had common ancestry hundreds of generations ago. The 
term coalescence refers to this process, looking backward in time, and seeing 
how two genes merge at times of common ancestry. Along this process, one 
goes from a sample of k genes, to fc - 1 ancestors after the first coalescence, to 
fc - 2 ancestors after the second coalescence, and so forth until there is a single 
common ancestor for the whole sample The idea of the coalescent is to con- 
sider the ancestral history of genes in a sample by developing a model for the 
time to common ancestry (Kingman 1980). 

To understand how the coalescent process works, consider in Figure 7.20 
what happens as time moves forward. In each generation there are a number 
of alleles in the population, and those alleles may be reproduced and be pre- 
sent in the following generation (moving down the figure), or, in some cases, 
an allele is not reproduced and is lost from the population. By chance, some 
alleles may be sampled twice in constituting the next generation, and the 
probabilities of these events are the same as those under the Wright-Fisher 
model of random genetic drift. By a repetition of this process over time, even- 
tually one of the original alleles will become "fixed" in the population. In the 
absence of mutation, the population would therefore be fixed for the same 

Figure 7.20 Diagram showing paths of ancestry of a set of alleles sampled at 
the present. The population is represented as having a constant size Starting at 
the top and working down, notice that many alleles go extinct, and one allele 
goes to fixation. Considering this process in reverse, the current sample 
observed at present undergoes a series of coalescence events in which the £ alle- 
les present in the current generation had only k - 1 ancestors. This process con- 
tinues backward in time until there is only one ancestral allele. 

306 Chapter 7 

allele; however, because mutation may occur during the process, the alleles 
observed at the present will not all be identical in nucleotide sequence, even 
though they all descended from a single common ancestral allele. 

In reality we do not have the genealogical information enabling us to fol- 
low all the alleles through time in a population. Typically what we have is a 
single "snapshot" represented by a small sample of alleles taken at the pre- 
sent time. Now consider Figure 7.20 again, but this time look at what hap- 
pens when we go backwards in time. We start with the k alleles in the sample 
at generation 0. In going from generation to generation 1 (one generation 
ago), we see that the two rightmost alleles "coalesced" into a single ancestral 
allele. As we go further back in time, the number of ancestral alleles has to 
either remain the same or decrease, and each reduction in the number of 
ancestral alleles is called a coalescence event In order to show how this idea 
can be extended to derive expressions for the entire distribution of branch 
lengths of a gene tree, we next specify a model. 

Consider two alleles. The probability that the two alleles came from the 
same allele in the previous generation is 1 /ZN (in a diploid population of size 
N), so the chance that they came from two distinct alleles the previous gener- 
ation is 1 - 1/2N. The probability that three alleles had three distinct ances- 
tral alleles the previous generation is Pr(alleles 1 and 2 have distinct 
ancestors)Pr(al!ele 3 is different from both 1 and 2) = (1 - 1 /2N)(1 - 2/2N). In 
general, the probability that k alleles had k distinct parental alleles the previ- 
ous generation is 




Each generation the sampling process occurs independently of what hap • 
pened before, and so the probability that k alleles had k distinct parental alle- 
les two generations ago is the square of the right-hand side of Equation 7.37. 
Consider two alleles again. Suppose we wish to know the chance that the 
common ancestor of these two alleles occurred exactly t generations ago In 
this case there must have been no coalescence (i.e., two distinct ancestral lin- 
eages were found) for t - 1 generations, and then, in the next preceding gen- 
eration, a coalescence occurred. The probability of not coalescing for t 
generations is (1 - 1 /2N)' and the chance of the two alleles coalescing in any 
one generation is 1/2W. The desired probability is the product of these or 

Pr(2 alleles had common ancestor t generations ago) = -zrrfi ~ (1 / 2N )1 





Random Genetic Drift 307 


ie exponential is an approximation that is quite good when 1 /2W is small 
- .us distribution has a mean of 2W generations and a variance of (4N) 2 Note 
that the confidence interval around the mean time is not very hght, since the 
standard deviation of the distribution is equal to the mean 

Returning to our sample of k alleles, the probability that the k alleles do 
not coalesce for f generations, then one pair coalesces to give k - 1 alleles at 
r + 1 generations ago is as follows: 

Pr(fr ancestors for f generations, ft - 1 ancestors at / + 1 generations ago) 

= Pr(*) , [l-rr{fr)] 

W eXp 




This approximation is valid if k « N. The distribution in Equation 7 39 has 
a mean of4N/[k[k - 1)] generations and a variance of \6N 2 /[k(k - I)} 2 . Figure 
7.21 shows what thegene genealogy is expected to be. Starting with five alle- 
les, the first coalescence is expected to occur 2N/10 generations ago, the next 
at 2N/6 generations prior to that, and so on. Note that the time intervals get 



T 2 E(T 2 ) = 2N 

r, rirj-.^ 

It r.(T t ) = ^ 

h r(7,) = 2? 

Figure 7.21 The process of coalescence can be represented by a cone tree At 
each generation, if there are k alleles present, the expected time back to the next 

coalescence is 2N /^ j . Starting with five alleles, the expected time back to the 

first coalescence is 2W/10. Note that the successive times got longer. When Ihere 
are only two alleles, the time back to the final coalescence is 2A/ generations 

308 Chapter 7 

longer and longer as the number of lineages decreases. The distribution of 
each of these time intervals is exponential, with ever -increasing means as one 
goes back in time The time to the coalescence of all of the k alleles (i.e , the 
most recent time that one sample of 11 alleles shared a common ancestor) is 


f = 4N(1-l/fc) 


with variance 

v = 4 N 2TlJL 7.41 


(Kingman 1 982; Tajima 1983) As the sample size * increases toward the total 
population size, f approaches 4N, which equals the expected fixation time for 
a newly arisen mutation. These principles allow us to generate simulated 
gene genealogies whose branch lengths correspond to the assumptions of the 
Wright-Fisher model. One thing the model still lacks is mutation, which is 
introduced in the next discussion. 

Coalescent Models with Mutation 

In order to generate simulated gene sequence data representing samples 
drawn from a population obeying the infinite sites model, Hudson (1990, 
1993) showed that one can procped as follows: 

• Determine the sample size k and the 6 for the gene region of interest; 

• Draw random numbers with appropriate exponential distributions to 
construct a gene genealogy such that times of coalescence follow Equa- 
tion 7.39; 

• On each branch of this tree, distribute mutations with a Poisson distribu- 
tion on each branch, such that the mean number of mutations on each 
branch is given by 2N\xt, where t is the branch length. 

This procedure has been widely used in generating data sets under the 
neutral hypothesis for comparison to observed data sets. 

From Figure 7.21, it follows that the sum of the branch lengths for the 
entire gene tree is 

T = £'T, 7.42 

The expected number of segregating sites in the whole sample is 2NuT, 
where T is the sum of the branch lengths, so substituting we get 

E(S) = 2N/iT = |X'E(7;) = 9SJ 


Random Genetic Drift 309 

The rightmost expression agrees with Equation 7 34, which we derived for 
the infinite-sites model 

The coalescent approach can be used to derive many fundamental princi- 
ples in population genetics. As one example, consider a population presently 
m mutation-drift equilibrium, In the previous generation, a pair of alleles can 
either coalesce, with probability 1 /2N, or failing to coalesce, one or the other 
allele may mutate with probability 2p. (The factor 2 comes in because either 
copy can mutate.) These are the only two events that affect identity, and the 
sum of their probabilities is 1 /2N + 2u The probability of identity is therefore 
the fraction of the time that the alleles coalesce; 

F = 

_ 2N _ 1 

2N + * 


7 44 

We have already derived this equilibrium identity under the infinite-sites 
(and infinite-alleles) models. Coalescence methods are not limited to the con- 
sideration of the Wright-Fisher model. If one can develop a recursion equa- 
tion for probabilities of recombination, migration, or other such phenomena 
in a gene tree context, then often powerful insights can be derived from coa- 
lescence approaches. For our purposes, suffice it to say that the method can 
generate classical results, often with much less difficulty, and the coalescence 
approach is especially well suited to testing hypotheses about samples drawn 
from populations. 

PftOBLEM 7A\ The probability distribution for the number of gen- 
erations back to the first coalescence (in a pure drift model) in a sample 
of k genes taken from a hapkricl population of size N is approximately: 



Pr(Jir9t coalescence t generations ago) = xe-* where x = 

From this one can show that the mean number of generations back to 
the first coalescence is 1/*. The more genes in the sample, the more 
likely it will be that a coalescence occurred recently. Calculate the 
expected time to first coalescence in a population of N «= 450 for a 
sample of 10 genes. How many genes would you have to sample to 
halve this coalescence time? 

310 Chapter 7 

ANSWER The expected time to first coalescence in a population of 
N = 450 for a sample of 10 genes is 

N /i k ) = 450 /[ 10 ] = 450 /(10 x 9 / 2) = 10 generations. 

To determine how many genes one would have to sample to halve 
this coalescence time, solve for 

5 = 450/ 

This is equivalent to 90 = ft!/[2!(fc - 2!)]. By trial and error, you will 
find that a sample of 14 genes will do it. Note that by increasing the 
sample only from 10 to 14, we expect to find a pair of alleles half as 
divergent from each other, 


Gene frequencies fluctuate at random in finite populations The rate at which 
allele frequencies change varies inversely with population size. The reason 
for I he inverse relationship is that the sampling variance, when two alleles 
are segregating in a population, is determined by the binomial sampling 
process, and the binomial variance is m /2N. The Wright-Fisher model 
extended the idea of binomial sampling over multiple generations, and much 
of our understanding of drift has been derived from this model. In a popula- 
tion in which the only force acting on gene frequencies is random drift all 
variation must ultimately be lost. The Wright-Fisher model shows why the 
probability that an allele will drift to fixation is equal lo its initial frequency 
in the population The diffusion approximation of the Wright-Fisher model 
is a second-order partial differential equation that yields the distribution 
Mx t) giving the number of populations with allele frequency x and time f. 
The diffusion approach has yielded important insights into the consequences 
of drift including the expected time to fixation and loss of alleles. The expect- 
ed lime to fixation of a newly introduced allele is AN generations, showing 
once again that drift happens faster in smaller populations. 

A useful way to think about random drift is to consider a set of subpopu- 
lations of the same size undergoing repeated generations of sampling and 
drift Within each of these subpopulations, genotypes are composed by 
drawing alleles at random, so that each subpopulation is always in Hardy- 
Weinberg equilibrium The hypothetical population composed by pooling 

Random Genetic Drift 311 

the subpopulations will have a deficit of heterozygous because, as allele fre- 
quencies drift closer to fixation, the frequency ofheterozygotes declines The 
rate at which heterozygosity is lost in a finite population is (1 - I /IN), so that 
a population of size 10, say, loses 5% of its heterozygosity each generation. 
The allele frequencies of the subpopulations are equally likely to drift up as 
down, so the average allele frequency over subpopulations shows no change. 
Real biological populations do not precisely fit the Wright-Fisher model 
They generally exhibit changes in allele frequency that exceed the amount 
expected based on the actual population size. The usual reason for the discrep- 
ancy is that the drift process occurs as though there are fewer than the 
observed census number ol individuals. The models give better correspon- 
dence to reality by calculating the effective population size Several different 
factors that require consideration in calculating effective size were examined in 
this chapter, including unequal sex ratio, fluctuation in population size over 
generations, and the uniparental transmission of mtDNA and Y chromosomes. 
Mutation introduces variation into populations, and random genetic drift 
erodes that variation. These two forces come to a steady state predicted by 
population genetic models. The infinite-alleles model assumes thai each new 
mutation generates a novel allele. The steady-state balance between mutation 
and drift in the infinite-alleles model is given by the aulozygosity, F, which is 
also the probability that two allelels are identical by descent. Under the infi- 
nite-alleles model for a diploid population, F = l7(l + 6), where = 4/Vu 
Note that the mutation rate and population size aie confounded in this 
model, increasing either one will decrease the autozygosiry by the same 
amount. We can write the same equation in terms of heterozygosity H=1-F, 
giving H = 0/(1 + 0). The infinite-sites model is related to the infinite-alleles 
model, but more specifically states that novel mutations occur at a site along 
thegene that has not mutated before. (If this is true, each new mutation must 
also generate a novel allele.) The mfinite-sites model generates predictions 
about the number of segregating sites expected in a population at steady 
state. Here the result is that the expected number of segregating sites is 
E(S) = 8 ]T f=| -, where the mutation rate in this val ue of 6 is the mutation rate 
over the entire gene in question. 

Classical models of random genetic drift look forward in time, following 
alleles as they are lost from the population and generated anew by mutation. 
More recently, the coaleseent approach has been to look backward in time, 
starting with the observed sample of alleles, and calculating times to com- 
mon ancestry ol alleles. Coaleseent approaches are particularly appropriate 
when one wants to consider the probability that a particular observed set of 
molecular sequence data might have the characteiislics expected of random 
genetic drift Computer generation of gene trees using principles of coales- 
cence theory makes it easy to produce a null distribution giving the full range 
of outcomes expected under a drift model. 

312 Chapter 7 


I. Suppose that in one generation, in a population of size 50, the average 

heterozygosity (averaged across loci) is reduced from 50 to 0.42. Is the 

population mating at random? 
2 In how many generations will the expected heterozygosity be 5% of the 

initial value in a diploid randomly mating population of size of 10? Size 


3. A gene in one individual in a population of 24 barn cats undergoes muta- 
tion to a new neutral allele. What is the probability that the allele eventu- 
ally becomes fixed? What is the probability that it eventually becomes 
lost? What are the answers if the mutant gene is X-hnked and the popu- 
lation consists of equal numbers of males and females? 

4. If an isolated population of annual alpine plants decreases in heterozy- 
gosity by half every 50 years because of random genetic drift, what is its 
effective population size 7 

5. Remote Pitcairn Island in the South Pacific was settled in 1789 by Fletch- 
er Christian and eight fellow mutineers from HMS Bounty, along with a 
small number of Polynesian women. Although many descendants have 
left the island in the intervening years, there has been essentially no 
immigration. Assuming an effective size of 20 in each of the eight gener- 
ations since the island's settlement, what value of F ST would be expected 
in today's population from random genetic drift? 

6. In a population of effective size N - 50, how long is required for random 
genetic drift to double the value of the fixation index F from 0.01 to 0.02? 
From 05 to 10? Assuming that F is small, how many generations are 
required to double the value of F in a population of effective size N? For 
the latter, use the approximations that j 1 - {1 /2N)]' = exp(-f /2JV) and 
that, when F is small (F < 0.10), ln(1 - F) = -F. 

7. What is the effective population number in a population of large preda- 
tory cats in which each breeding male controls a harem of five females 
and the total population consists of 200 males and 200 females? 

8. What is the effective population size of a herd of ten dairy cows and one 
bull? What is it for 40 cows and one bull? For 10 cows and two bulls? 

9. What is the variance effective population size for an X-linked gene in a 
population consisting of 100 females and 10 males? In a population of 10 
females and 100 males? 

10. Among 100 restriction site differences in two inbred strains of the flour 
beetle Tribolhim that are crossed and allowed thereafter to mate at ran- 
dom, what number of restriction sites would be expected to remain seg- 
regating after 10 generations assuming an effective population size of 80 
individuals? How many would be expected to remain unfixed after 50 

Random Genetic Drift 31 3 

11. In a haploid population of constant effective size 50, what is the proba- 
bility that two randomly drawn alleles shared a common ancestor exact- 
ly 100 generations ago 7 

12 Employing the infinite-sites model, if = 10, how many segregating sites 
will one expect to find in a sample of size 10? 20? 50? 

13. Consider an isolated island population with no migration, effective pop- 
ulation size of 250,000, and a mutation rate of 10"*. Calculate the expect- 
ed heterozygosity under the infinite-alleles model How much migration 
is necessary to increase H to 2 /, 7 

14. In a haploid population of effective size 50, how large a sample must one 
take to yield an expected mean coalescence time of 10 generations? 

15. Show that random genetic drift requires an average of t = 2N \nx genera- 
tions to reduce the heterozygosity from H to H /x. 

16 Use Equation 7.15 to show that approximately 2/V generations of random 
genetic drift are required to reduce the number of segregating genes by a 
factor of e (e = 2.71828 . . .), given initial allele frequencies close to 0.5. 

17. A set of six to eight oocytes from each of three women undergoing in 
vitro fertilization (IVF) were recently tested for heteroplasmy (presence 
of more than one mitochondrial DNA type within each cell). The mtDNA 
from eggs of two women were all identical and matched that of somatic 
cells with no heteroplasmy, but the other woman produced eggs with two 
different mtDNA types. Densitometry scans allowed investigators to 
determine that the individual cells had relative frequencies of the two 
mtDNA types ranging from 20% to 50%. Assuming 30 cell generations 
from zygote to zygote in the maternal germline, and N = 1000 mitochon- 
dria per cell, what do you conclude from these observations? Are they 
consistent with neutral sampling of mtDNA types? 


Molecular Population Genetics 

Molecular Clock - Synonymous and Nonsynonymous Substitution 
Codon Bias Gene Genealogies Organelle DNA 

Molecular Phylogenetics 

HLL THE forces in population genetics have an impact on the 
pattern of variation seen in molecular sequences of genes, includ- 
ing mutation, migration, selection, and random drift. A primary 
focus of molecular population genetics is to make inferences about the con- 
tribution of each of these evolutionary forces to produce the patterns of mol- 
ecular sequence variation we see today. Usually this process involves a close 
interplay between mathematical model building, statistical parameter esti- 
mation, and experimental observation. Several times in the past, unexpected 
patterns of sequence variation have arisen which, in turn, gave rise to whole 
new avenues of theoretical inquiry. In many cases, inferences about evolu- 
tionary forces transcend species boundaries by making use of data on both 
within-species polymorphism and between-species divergence. The genetic 
basis for species isolation is itself amenable to analysis. But first let us begin 
with the basic theoretical principles that underlie molecular population 


The first systematic application of protein electrophoretic methods to popu- 
lation genetics revealed extensive genetic variation within most natural pop- 
ulations. Typically, 15 to 50% of the genes coding for enzymes were observed 
to include two or more widespread, polymorphic alleles. The polymorphic 
alleles occurred with frequencies considered to be too high to result from 


316 Chapter 8 

equilibrium between adverse selection and mutation Motoo Kimura sug- 
gested most polymorphisms observed at the molecular level are selec- 
tively neutral, so that their frequency dynamics m a population are deter- 
mined by random genetic drift (Kimura 1968) By extension, the hypothesis 
of selective neutrality would also apply to most nucleotide or amino acid 
substitutions that occur within a molecule during the course of evolution. 

The neutral theory has been of great importance in population genetics in 
stimulating the collection and analysis of data in attempts to evaluate its ade- 
quacy. Mathematical investigations of its implications have resulted in one of 
the most complete and elegant theories in all of biology Tests of the corre- 
spondence of sample data to the neutral theory are almost universally low in 
power, which means that large sets of data are needed before one has a rea- 
sonable chance of rejecting neutrality. The recent trend has been that more 
and more cases of departures from neutrality are being found, in part 
because of the expansion in available data and in part because of the increas- 
ing subtlety of tests that are applied Regardless of the action of other forces 
shaping molecular sequence variation in populations, the force of random 
drift is always there, and for this reason the neutral theory remains useful in 
generating rigorous null hypotheses The next section summarizes some of 
the theoretical implications of the neutral theory and some of the data bear- 
ing on it. 

Theoretical Principles of the Neutral Theory 

The neutral theory models the fate of mutations that are so nearly selective- 
ly neutral in their effects that their fate is determined largely through ran- 
dom genetic drift A variety of mutation models have been considered, 
including infinite-alleles, infinite-sites, and finite-sites models. In all models, 
though, random drift occurs when N adult individuals produce an infinite 
pool of gametes from which 2N are chosen at random to create the N zygotes 
of the next generation. Much of the complexity of the mathematics of the 
neutral theory arises from the fact that the mutational histories of alleles are 
not independent, because they share an overlapping genealogical history. 
Before we get into the details of the predictions of the neutral theory, let us 
first review some of the theory's principal implications (Kimura 1983). 

I Tl a population contains a neutral allele with allele frequency p Ql then the 
probability that the allele eventually becomes fixed equals p n . In particu- 
lar, a newly arising neutral mutation occurs in just one copy so the initial 
allele frequency is p a - 1 /IN, and the probability of eventual fixation of 
the mutation is therefore 1 /2N. Figure 8.1 shows that a mutant allele aris- 
ing in a smaller population has higher chance of fixation. 

2. The steady-stale rate at which neutral mutations are fixed in a population 
equals \x, where u is the neutral mutation rate. It is noteworthy that the 
equilibrium rate of fixation does not involve the population size N. The 

Molecular Population Genetics 3 1 7 


Figure 8 1 Diagram showing the trajectory of neutral alleles in a population 

, ?/S S w tCT * e P°P u,aliim b V mutation and have an initial allele frequen- 
cy of 1 f2N. Most alleles are lost, but those that go to fixation take an average of 
4N generations. The time between successive fixations of neutral alleles is 1/u 
generations (A) A moderate size population. (B) The same population size- a 
higher mutation rate gives the same time to fixation, but less time between fixa- 
tions. (C A smaller population has alleles that go to fixation more rapidly but 
the time between fixations is still 1 /p. (After Kimura 1980 ) 

reason is thai the N cancels out: The overall rate is determined by the 
product of the probability of fixation of new neutral mutations ( 1 /IN) 
and the average number of new neutral mutations in each generation 
(2A/p), hence (1 /IN) x (2Nu) = p. 
3. The average time that occurs between consecutive neutral substitutions 
equals 1/u. This principle lollows directly from the one above. If the 
steady-state rate of fixation is u per unit time, the average length of time 

318 Chapters 

between substitutions will be the reciprocal, or 1 /u. By way of analogy, if 
a Swiss clock cuckoos at the rate of 24 times per day, then the average 
length of time between cuckoos is 1 /24th ol a day, or one hour. As Figure 
8.1 shows, the time interval between fixations is independent of popula- 
tion size, and elevating the mutation rate decreases the time interval 
between fixations. 

PROB LEM 8. 1 The neutral theory makes a strong prediction about 
the relationship between population si2e and heterozygosity. Under 
the infinite-alleles model, we can express the prediction by the for- 
mula, H = 4Nu/(4Np+l), and hence small populations should have 
low heterozygosity and large populations high heterozygosity. Do the 
data support this prediction? A survey of 77 species reviewed by Nei 
and Graur {1984} found that species with very small populations (less 
than, say, 10*) had a mean protein heterozygosity of 0.05, whereas 
those species with a very large population (greater than 10 , say, 
which include Dwsophila species), have heterozygosities of around 
0.2. This positive correlation seems to favor the neutral theory, except 
that the range of H is much smaller than theory predicts in view of the 
enormous range in N. When these extremes of population size ate 
excluded (W < 10 4 and N > 10 9 ), there is no significant correlation 
between population size and heterozygosity. What is going on? 

ANSWER The paradoxical result demonstrates that levels of vari- 
ability in a population are determined by several forces, arid that dif- 
ferent organisms may be affected by the forces to different 
magnitudes. The result does not support the neutral theory insofar as 
it shows that population size does not, by itself, explain levels of vari- 
ation. On the other hand, the discrepancy is not grounds to complete- 
ly toss out the neutral theory. For one thing, the population sizes were 
generally roughly estimated, and effective sizes (Chapter 7), which 
were not estimated, are more relevant to neutral predictions of het- 
erozygosity There is also an implicit assumption that mutation rates 
are identical in all organisms, and violations of this assumption can 
be found. 

4. Analysis of the diffusion equation has shown that, among newly arising 
neutral alleles that are destined to be fixed, the average time to fixation is 
4N,. generations (where N c is the effective population size). This too is evi- 

Molecular Population Genetics 319 

dent in Figure 8.1: alleles that go to fixation do so in less time m the 
smaller population. Among newly arising neutral alleles destined to be 
tost, the average time to loss is (2N P /N)ln(2N) generations. The average 
times required for fixation or loss af*ply to newly arising alleles, which 
are necessarily present in just one copy, so p = l'/2N. The implication of 
these formulas is that, on average, neutral mutations that are going to be 
fixed require a very long time for this to occur, but mutations destined to 
be lost are lost quite rapidly. 
5. If each neutral mutation creates an allele that is different from all others 
existing in the population in which it occurs, then, at equilibrium, when 
the average number of new alleles gained through mutation is exactly off- 
set by the average number lost through random genetic drift, the expected 
homozygosity equals l/(4W,u + 1), where li is the neutral mutation rate. 
The model of mutation in which each new allele is novel is the infinite- 
alleles model of mutation. The quantity 4W,p, which shows up frequently 
in the neutral theory, is often denoted as 9. The equilibrium average 
homozygosity is therefore 1 /(I + 0). Since the heterozygosity equals one 
minus the homozygosity, the average heterozygosity at equilibrium in the 
infinite-alleles model equals 9/(1 + 9). Larger populations are expected to 
have a higher heterozygosity, as reflected in the greater number of alleles 
segregating at any one time in the larger populations in Figure 8.2. 


g 15 


4 6 8 10 

Logarithm of population si?f 

Figure 8.2 Given the enormous variation in effective population sizes, one 
would expect to see a wider range in variation in heterozygosity than ^'actually 
observed. The relation between population size and heterozygosity does not fit 
the neutral theory expectation over a wide range of intermediate population 
sizes. (After Nei and Graur 1984.) 

320 Chapter 8 


Rates of Amino Acid Replacement 

The initial impetus for Ihc neutral theory came from observations on the rate 
of amino acid replacements in proteins. When extrapolated to the entire 
genome, the inferred rate of evolution was several nucleotide substitutions 
per year. This rate was regarded as much too high to result from natural 
selection, because the intensity of selection must be limited by the total 
amount of differential survival and reproduction that occurs in the organ- 
ism. Direct DNA sequencing later revealed that rates of nucleotide substitu- 
tion vary according to the function (or presumed absence of function) of the 
nucleotides. The type of data that must be analyzed are best illustrated by 
example The first 18 amino acids present at the amino terminal end of the 
human and mouse y-interferon proteins constitute a signal peptide that is 
used in secretion of the molecules {Gray and Goeddel 1983). The sequences 

Human Met Lys Try Thr 
Mouse Mel Asn Ala Thr 

Tyr lie 
Cys He 

Leu Ala Ph<> Cln Leu Cys lie Val Leu Cly Ser 
Leu Ala Leu Cln leu Phf? Leu Mel Ala Val Ser 

In order to calculate the proportion of amino acids that differ in the two 
signal sequences, we can simply count the number of sites that are the same 
and the number of sites that differ. Among the 18 amino acids there are 10 
differences, so the proportion different is 10/ 18 = 0.56. 

To interpret these data, let us suppose that amino acid replacements occur 
at the rate X per unit time. Consider two independently evolving sequences, 
initially identical, which at time f are found to differ in the proportion D, of 
their amino acids. After the next time interval, the proportion of differences 
D,,i is given by 

D ftl = (l-D f )(2X) + D ( 8.1 

In this equation, (1 - D,)(2X) is Ihe proportion of sites, previously identi- 
cal, in which one or the other underwent an amino acid replacement during 
the time interval in question, which must be added to the already existing 
differences D, in order to give the total. (The equation ignores the unlikely 
possibility of an amino acid replacement making two previously different 
amino acid sites identical.) The factor of 2 is present because the total time for 
evolution is It units (f units in each lineage after the split), which is illustrat- 
ed in Figure 8.3. Equation 8.1 suggests the differential equation 

which has the solution 

dD/di = D M -D, = 2X- 2XD t 

D t = l-e~ 


Molecular Population Genetics 321 

rm|ii>iln>n ol 
diflnri-ke-. O 


toUil k-nglhr 



Total time = 2) 


Figure 8.3 Two amino acid or nucleotide sequences that have each undergone 
independent evolution from a common ancestor for f time units are separated 
by a total time of 2r units because there are t units in each lineage after the split 
The proportion of sites that differ in the sequence is denoted D and Hie total 
number of sites L. fn this particular example, L = 10 and D = 3/10 

An alternative argument can be used to derive Equation 8 3 without 
resorting to differential equations. If X is the rate of amino acid replacement 
per unit time, then the probability that a particular site remains unsubsti- 
tuted for f consecutive intervals along each of two independent lineages is 
(1 - X) , which is approximately equal to e' 111 , provided that Xt is not too 
large. Thus, the probability D, of one or more replacements occurring in t 
units of time after divergence is approximately 1 - c~ 7U r which is Equation 

Since X is the rate of amino acid replacement per unit time, the expected 
proportion of differences between two sequences at any time t is 

K = 2Xt 


where the factor of 2 is again present because the total time for evolution is 
It units (Figure 8.3). 

Substituting K from (8.4) into (8.3) and rearranging vields the following 
estimated of K, B 

fc = -\n(\-D) 


where D is the observed proportion of sites in which two sequences differ. If 
the sequences under comparison are L amino acids in length, then the esti- 
mated variance Var(K) of K is estimated from the distribution of X implied 
by the substitution process and is approximately 

Var(K) = D/f(l - D)L] 


322 Chapter 8 

The rate of evolution at the molecular level is given by the amount of 
sequence divergence that occurs per unit of lime. Thus, as suggested by 
Equations 8.4 and 8.5, if two sequences are compared, and these are known to 
have diverged from a common ancestral sequence an estimated t time units 
ago, then the rate of evolution k may be estimated as 

i = k/2t 


The units of X are usually expressed as replacements per amino acid site (or 
substitutions per nucleotide site) per year. 

The quantity K is used in preference to D in estimating the rate of molec- 
ular evolution because K takes multiple substitutions into account. Over long 
periods of evolutionary time, the amino acid present at a particular site may 
be replaced several times, first by one alternative, then by another, then still 
another, and perhaps, at some stage, even return to the amino acid originally 
present at the site. When comparing two sequences, only the sites that are dif- 
ferent can be identified. Sites that are identical at the present time may 
include some that were different in the past, and sites that are different at the 
present time might have undergone more than one substitution. The quanti- 
ty D is determined only by the proportion of differences between the 
sequences observed at the present time. The estimate K makes a correction 
for multiple substitutions, but at Ihe cost of introducing assumptions that the 
substitutions occur independently and at the same rale through time. 

For relatively short intervals of evolutionary time, during which multiple 
substitutions remain uncommon, the correction is minor, and the value of K 
is close to that of 6. This can be seen by the fact that the initial slope of the 
curve plotted in Figure 8.4 is 1 . As the observed sequence divergence increas- 
es, it becomes more likely that multiple hits have happened, so the slope 
decreases. Over longer intervals, when many multiple substitutions have 
occurred, the correction is important, and the assumptions on which it is 
based must be evaluated critically. Correction for multiple substitution events 
is even more important for nucleotides than it is for amino acids. With amino 
acids, the probability of a random replacement returning an amino acid site 
to its original identity is '/so (assuming equal frequencies), whereas for 
nucleotides it is '/). 

PROBLEM 8.2 Use the data in the preceding example to estimate 
the average rate of amino acid replacement in the signal peptide of 
•y-interferon during the divergence of mice and humans. Based on 
fossil evidence, the separation of these species occurred approxi- 
mately 80 million years ago. 

Molecular Population Genetics 323 

Substitutions per site, K 

Figure 8.4 As sequences become more divergent over time, the number of 
substitutions per site (K) can continue to increase, but the proportion of sites 
that mismatch in the observed sequences (D) saturates. 

ANSWER For the signal peptide, D = 0.56 and K = -ln(l - 0.56) = 
0.82. The estimated rate of evolution is therefore 0.82/ [2 x (80 x 10")] = 
5.1 x 10" 9 amino acid replacements per amino acid site per year. The 
standard deviation of K is estimated as equal to [0.56/(0.44 x 18)] 1/2 = 
0.27. With such a small sample size, the estimates could ordinarily not 
be taken too literally. However, in this case, the average rate for the 
signal sequence is very close to the average rate for the molecule as a 
whole. For y-interferon, among 155 amino acid sites there are 91 dif- 
ferences, giving K a 0.88 ± 0.22 and an average rate of 5.5 x 10"* amino 
acid replacements per amino acid site per year. 

Rates of amino acid replacement vary over a 500-fold range in different 
proteins. The rate of amino acid replacement in y-interferon is one of the 
fastest rates known (Li et all 1985). Among the slowest rates is that of histone 
H4, for which % = 0.01 x 10" 4 per year. The average rale among a large num- 
ber of proteins is very close to the rate found in hemoglobin, which is approx- 
imately 1 x 10 " 9 amino acid replacements per amino acid site per year 

To be concrete about the interpretation of the rate of amino acid replace- 
ment, consider a protein exactly 100 amino acids in length, m which the rate 

324 Chapter 8 

of amino acid replacement per amino acid site equals 1 .0 x 1 Q per year. For 
the entire protein, the rate of replacement equals 100 x 1.0 x 10" 9 = 1 x 10" 
per year. In two different species, therefore, the protein would accumulate 
amino acid differences at the rate of one replacement every 5 million years 
since their divergence from a common ancestor [because (5 x 10 ) x 2 x 
(1xl(r 7 ) = 10]. 

The simple model that we just examined makes an assumption that is 
violated by an abundance of data. We assumed that all amino acid replace- 
ments occur with equal likelihood Besides the fact that real proteins violate 
this assumption, we might not have expected it to be true, since some amino 
acid changes require a single underlying nucleotide change, while others 
require two or even three changes. More sophisticated models for amino acid 
sequence evolution account for these differences by weighting amino acid 
changes with their observed rates of change (Dayhoff 1972; Jones et al. 1992). 

Rates of Nucleotide Substitution 

Nucleotide sequences are analyzed in the same manner as amino acid 
sequences, but the analogous equation to (8 1) is slightly more complicated 
because it has to correct for cases in which a substitution makes two previ- 
ously different nucleotide sites identical. The correction is significant for 
nucleotide sequences because an expected one third of random substitutions 
will make two previously different nucleotides identical. The correction is 
usually unnecessary for proteins because only V, P of random replacements 
make two previously different amino acids identical. 

Several models of nucleotide substitution have been studied, which differ 
primarily in the assumptions about rates of mutation between pairs of 
nucleotides. The simplest model is one in which mutation occurs at a con- 
stant rate, and each nucleotide is equally likely to mutate to any other (Jukes 
and Cantor 1969). If a is the rate of mutating from one nucleotide to a differ- 
ent nucleotide, then in any time interval, A mutates to C with probability a, 
A mutates to T with probability a, and A mutates to G with probability a. 
The probability that A does not mutate in this interval is therefore 1 - 3a. 
The probability that a particular site is A at time r + 1 is 

P* i+] ) = 0-3a)P*„ + a(l-P>, ( ,)) 8.8 

because the first part of the equation gives the probability of having been A 
at time t and not mutating, and the second part is the probability of being 
any other nucleotide and mutating to A. From 8.8 it follows that 

Paim) ~ Pa\d = dP Mi) /dt = -4c* P m + a 

Solving this differential equation, 

P/,m = '/♦ + % ^ B ' 



Molecular Population Genetics 325 

assuming that the initial state was A This is the transition probability from 
A to A, which we can write as P M . If vve observe two sequences that have 
been separated for time f, then the probability that thev continue to carry the 
same nucleotide at a particular site is 

Paa = '/< + % e 


because 2r is the total duration of time along both lineages during which 
changes could occur. Let d be the proportion of nucleotide sites that differ 
between two sequences: 

d = l-P AA 

d=%0-e~* M ) 



In the previous symbols, I is the rate of mutation to a nucleotide differ- 
ent from the current nucleotide, so relating this to <x, we have X = 3a. This 
implies that k = 2Xt = 2(3o/) = bat. Taking logarithms of both sides of Equa- 
tion 8.13, we deduce 

and, since k = V 4 (8af), 


£=-y 4 In (1-4,1/3) 



where k is the expected proportion of nucleotide sites that differ between two 
sequences at a time I units after their evolutionary separation. By analogy 
with protein evolution, D is the observed proportion of L nucleotide sites in 
which the sequences differ. The variance Var ( it) of the estimate can be esti- 
mated as 



Figure 8.5 shows the relationship between time and d, and shows that nucle- 
otide sequences that follow the Jukes-Cantor pattern of mutation (all 
nucleotides equally interchangeable) approach an asymptote, showing a 
divergence of %. This makes intuitive sense because, after sufficient time, the 
common ancestry of the sequences has been erased, and % of the sites will 
match by chance. 

PROBLEM 8.3 The coding region of the irpA genes in strains of the 
related enteric bacteria Escherichia coli strain K12 and Salmonella 
typhimurium strain LT-2 were sequenced and compared (Nichols and 

326 Chapters 

Figure 8.5 Simulations of the substitution process for nucleotide sequences 
show that the sequence divergence saturates at d = 0.75. The jagged lines are 
numerical simulations of a sequence of length 1000, and the dots give the pre- 
diction under the Jukes-Cantor model 

Yanofsky 1979). The trpA gene codes for one of the subunits of" the enzyme trypto- 
phan synthetase used in the synthesis of tryptophan. Estimate the amount of nucle- 
otide divergence It and amino acid divergence K and their standard deviations. 


Vat Ala Pro lie Phe He Cys Pro Pro Asn Ala Asp Asp Asp leu Uu Arg Cto «• Ate 


lie Ala Pro lie Phe lie Cys Pro Pro Asn Ala Asp Asp Asp Uw lau Arg Cln Val Ala 

ANSWER For the amino acid sequences, L = 20 and D - 2/20 = 0.10; thus K = 
-ln(0.90) = 0.105 with standard deviation 0.074. For the nucleotide sequences, I - 60 
and d = 9/60 = 0.15; thus ft = -3/ 4 ln(0.8) = 0.167 with standard deviation 0.058. 
Assuming that Escherichia and Salmonella diverged at around the time of the mam- 
malian radiation 80 million years ago, the rates of evolution are 0.167/(2 x 80 x 10 6 ) 
= 1.04 x 10'" nucleotide substitutions per year and 0.105/(2 x 80 x 10 6 ) = 0.66 x 10" 
amino acid replacements per year. In the gene as a whole, the values are ft = 0.300 
for nucleotide substitutions and K = 0.162 for amino acid replacements. 

Molecular Population Genetics 527 

The Jukes-Cantor model assumes that all possible nucleotide changes 
occur at an equal rate. In fact, it is generally observed from sequence com- 
parisons that transitions, or changes either from purine to purine (G<=> A) or 
from pyrimidine to pyrimidine (C<s=>T) are more frequent that transversions 
(the other possible changes). Kimura (1980a) sought to accommodate this 
observation by making a model with two mutation-rate parameters Transi- 
tions occur with rate a and transversions occur with rate p. The rate matrix 
below shows the parameters of the Kimura two-parameter model; as you 
might guess, other models can also be specified by adding parameters to 
this table These models can be fitted to the data in a variety of ways, includ- 
ing solutions of the sort we derived for the Jukes-Cantor model, as well as 
with more complex numerical methods. 

Rate matrix for the Kimura two-parameter model: 

Ending base 

























Usually we have data on more than two sequences, and estimates of 
the parameters of the substitution models are sought. If the phylogeny of 
the organisms in the data set is known, it is possible to calculate the like- 
lihood of the observed sequences given the phylogeny and the parameters 
in the model (Felsenstein 1981). Many advances have been made in recent 
years in applying the method of maximum likelihood for estimating para- 
meters of the substitution process in this context (Goldman 1993; Yang 

Other Measures of Molecular Divergence 

Rates of evolution and divergence times can be estimated from other kinds 
of molecular data if care is taken to consider carefully how the process of 
mutations results in differences in the data that are actually scored. For 
example, Randomly Amplified Polymorphic DNA (RAPD), which is ana- 
lyzed by the polymerase chain reaction, can be used to estimate nucleotide 
divergence only if one can verify some questionable assumptions about how 
PCR reactions work (Clark and Lanigan 1993). VNTR loci (loci that are poly- 
morphic due to variable numbers of tandem repeats generated by unequal 
exchanges) have a forward and back mutation patlern that results in very 
different population dynamics. In a similar manner, microsatellites, also 

328 Chapter 8 

known as STRPs (short tandem repeat polymorphisms), undergo increases 
and decreases in copy number such that small changes in copy number are 
more common than large changes. The result is a sort of stepwise mutation 
process, and models of this have yielded predictions about patterns of 
microsatellite variability that are roughly concordant with observations 
(Zhivotovsky and Feldman 1995). 


Although the rate of nucleotide substitution and amino acid replacement 
varies among different genes, the average rate of molecular evolution can be 
rather uniform throughout long periods of evolutionary time. Such unifor- 
mity in the rate of amino acid replacement or nucleotide substitution, first 
noted by Zuckerkandl and Pauling (1962), is known as a molecular clock. 

An example of the approximate uniformity in amino acid substitutions is 
illustrated in the evolution of the a-globin gene in the organisms depicted in 
the phylogenetic tree in Figure 8.6. The data are summarized in Table 8.1. The 
numbers above the diagonal are the percent amino acid differences 
(D x 100) between the a-globin sequences. For example, the a-globin genes of 
dog and human differ in 16.3% of their amino acid sites; since mammalian a- 
globin contains 141 amino acids, this percentage corresponds to 23 sites in 
which the amino acids differ. The percentages exclude differences that result 
from the insertion or deletion of amino acids, which are called gaps in 
sequence comparisons. For example, the comparison between human and 
shark a-globin is based on 139 amino acid sites that are homologous, and 
excludes gaps amounting to 11 additional amino acid sites. Missing from Fig- 
ure 8.6 are plants, which (remarkably) also have sequences, known as leghe- 
moglobin, that show significant homology to vertebrate globins (Landsmann 
etal. 1986). 

Beneath the diagonal in Table 8.1 are the estimated proportions of dif- 
ferences per amino acid site, calculated from Equation 8.5 as K = -In(l - D). 
The table also gives the average value of K in all comparisons with the shark, 
carp, newt, chicken, echidna, kangaroo, and dog, respectively, and the diver- 
gence times from the bifurcations in Figure 8.6. 

The average proportion of differences per site is plotted against diver- 
gence time in Figure 8 7. The very close fit to a straight line is evident Since 
the divergence lime is exactly half of the total time available for evolution 
(Figure 8.6), the rate of evolution A. can be estimated as one-half times the 
slope of the line in Figure 8.7. For these data, the slope is 1.8 x 10 9 , and there- 
fore K = 0.9 x 10 4 amino acid replacements per amino acid site per year. The 
good fit of the points to the straight line indicates that the actual rate of a- 
globin evolution has deviated little from the average for the past 450 million 

Molecular Population Genetics 329 



Newt „, , Echidna „ 

Chicken Kanpiin 

Figure 8.6 Phylogenetic relationships among eight vertebrate species ,md 
their approximate times of evolutionary divergence (From Kimura 1983 ) 

330 Chapter 8 


Shark Carp Newt Chicken Echidna Kang Dog Human 


59 4 


59 7 













48 6 



44 7 


47 5 










24 8 















































(Percentage data from Kimura 1983.) 

Notr Values above the diagonal are the observed percent amino acid differences (D) 
between the a-globin sequences in the species, values in boldface are the expected 
amino acid differences per site [K = -ln(l - D)j. Average values of K and the esti- 
mated times of divergence (in millions of years) are given at the bottom of the table 
Abbreviation Kang, kangaroo 

100 200 300 400 

Time (millions of years) 

Figure 8 7 Relation between estimated number of amino acid substitutions in 
oe-dobin (K) between pairs of the vertebrate species in Figure 8.6, against time 
since each pair diverged from a common ancestor. The straight line is expected 
based on a uniform rate of ammo acid substitution during the entire period. 

(From Kimura 1983.) 

Molecular Population Genetics 3 31 

PROBLEM 8.4 The fJ-globirt molecule in primates contains 146 
amino acids, and estimates of the number of amino acid differences 
among various primates are tabulated below (data from Kimura 
1983). Calculate the average rate of evolution of p-globin molecule in 
primates. {Hint; First calculate D and K for each species pair, then plot 
the points with time on the x axis and D on the y axis. Finally, do a lin- 
ear regression to estimate the average rate of substitution.) 

Time of divergence 

Average number of 

(mlHhm of years) 

amino add differences 













AN SWEft D values are obtained by dividing each number of amino 
acid differences by 146, and average values of K are estimated as 
-ln(l - D ). The average K values, from top to bottom, are 0.192, 0.180, 
0-044, 0.042, 0.018, 0,007, respectively. These are the y values in the lin- 
ear regression, and the x values are the divergence times. Altogether 
there are n = 6 points. In this case, X(xy) = 3.1263 x 10 7 , E(x) = 2.72 x 
10 s , %) - 0.482, and K* 2 ) - 1.5314 x 10 16 . The slope of the regression 
is 3.15 x 10" 9 , and the rate of evolution is half of this, or 1.58 x lO -9 
amino acid replacements per amino acid site per year. This estimate is 
reasonably close to the value of 0.9 x 10"* per year calculated for oc- 
globirt. (Note: Rather than calculate K from the average number of 
amino acid differences, it Would be more accurate to calculate K for 
each species comparison and then take the average; however, in this 
example, it makes very little difference.) 

Variation across Genes in the Rate of the Molecular Clock 

If an organism has a particular rate of mutation in its genome, one might 
think at first that the rate at which the molecular clock runs would be the 
same for all genes. But the neutral theory predicts that the rate of molecular 
evolution should depend on the neutral mutation rate, which may be quite a 
bit lower than the overall mutation rate, and may vary widely across genes. 
Figure 8.8 shows that three different proteins in the same organisms have 

332 Chapter 8 

2(10 3(10 

400 500 600 700 800 900 
Millions of years since divergence 

1000 1100 1200 1300 1400 

Figure 8.8 The molecular clock runs at different rates in different proteins. 
One reason is that the neutral substitution rate differs among proteins. Fibrino- 
gen appears to be relatively unconstrained and has a high neutral substitution 
rate, while cytochrome r has a lower neutral substitution rate, and may be more 
constrained. Data are from a wide variety of organisms (From Dickerson 1 971 .) 

widely differing molecular clock rates. Nevertheless, within each gene, we 
observe reasonably uniform rates of change. The variation across genes 
appears to be due to the fact that some proteins are highly tolerant of substi- 
tutions, whereas others suffer deleterious effects from even one or a few 
minor changes. Genes whose function is well buffered from the environment 
generally have a slower rate of substitution than genes whose products have 
a premium on variability. The extremes are represented by histone H4, at the 
low end, and y-interferon, at the high end, with globin proteins near the mid- 

Molecular Population Genetics 333 

die of the spectrum, In short, the molecular clocks for different genes "tick" 
at different rates. 

In addition to functional constraints affecting substitution rate, the pat- 
tern of hereditary transmission also affects substitution rate. Organelle 
genomes are replicated and transmitted in a manner distinct from nuclear 
genes, so it may not be surprising that they undergo substitutions with dif- 
ferent dynamics. Mitochondrial DNA exhibits wide variation in substitution 
rates across its relatively tiny genome, but in animals the substitution rate is 
generally much higher than the substitution rate of chromosomal genes. In 
plants, on the other hand, comparisons among nucleotide substitution rates 
of nuclear DNA, chloroplasts, and mitochondria reveal clear differences, with 
mtDNA showing less than one-third the substitution rate of chloroplast 
DNA, which in turn has about half the substitution rate of nuclear genes 
(Wolfe et al. 1987). In general, genes on the X chromosome have a lower rate 
of substitution than do genes on autosomes (Miyata et al. 1987). A higher rate 
of mutation in males (Shimmin et al. 1993) would lower the X-chromosome 
rate because the X chromosome spends more time in females. But the substi- 
tution rate for Y-linked genes is indistinguishable from that of autosomes 
(McVean and Hurst 1997), which suggests that the mutation rate is equal in 
both sexes but lower in X-linked genes than in autosomal genes. 

Not only do substitution rates vary from one gene to another, but they 
also vary widely across sites within each gene! If all sites did undergo sub- 
stitution at the same rate, then the number of substitutions per site should 
have a Poisson distribution. Fitch and IvTargoliash (1967) noticed that the 
cytochrome c data did not fit this model unless invariant and hypervariable 
sites were excluded. The models that we have developed so far assume that 
all sites evolve in the same way, so to accommodate this variability (and to 
test for how different the rates are) models that specifically incorporate rate 
variation must be developed. One convenient model is to assume that the 
rates vary according to a gamma distribution (Golding 1983; Wakeley 1993) 
Yang (1996b) reviews estimates of the rate-variation parameter of the gamma 
distribution, and finds that all 17 cases examined show significant among- 
site variation in substitution rate. 

Variation across Lineages in Clock Rate 

The neutral theory predicts that the rate of the molecular clock should run at 
different rates for different organisms having different neutral mutation 
rates. The range of mutation rates is impressive. Figure 8.9 shows the num- 
ber of nucleotide differences observed in the influenza MS genes, plotted 
against the year of isolation of the virus containing them. The rate of gene 
substitution averages X = 1.94 ± 0.09 x 10" 1 nucleotide substitutions per 
nucleotide site per year. Although the rate of gene substitution is about 
10 -fold faster than observed in germline genes in eukaryotes, it is neverthe- 

334 Chapters 




1950 r 1970 

Year of isolation 

Figure 8.9 Molecular evolution in the N$ genes of influenza virus determined 
from strains isolated and stored during the past 60 years. The total rate of evolu- 
tion in the 890-nuclcotide sequence averages 1 .73 + 0.08 nucleotide substitutions 
per year, and the rate is remarkably uniform (From Buonagurio el al. 1986.) 

less approximately constant during the period available for study. The extra- 
ordinary rate of evolution in influenza virus is thought to be related to a high 
rate of spontaneous mutation resulting from errors in replication (Holland et 
al. 1982). As in many other RNA-based viruses, the RNA replicase enzyme 
that replicates the influenza genome lacks a proofreading function. Rapid 
rates of gene substitution can be of immense medical significance. Yokoyama 
el al (1988) estimated the rate of substitution in the po! gene of the human 
immunodeficiency virus as 0.5 x If)" -1 per nucleotide site per year. The time 
of divergence between MI VI and HIV2 was estimated at just 200 years ago, 
and the bulk of the genetic variability among recently isolated strains of 
HIV1 has been generated in the last 20 years. 

The rate of the molecular clock also varies among taxonomic groups (Brit- 
ten 1986). For example, the insulin gene evolved much more rapidly in the 
evolutionary line leading to the guinea pig than in other evolutionary lines 
(King and Jukes 1969), and the C-type viral sequences integrated into the pri- 
mate genome evolved at twice the rate in Asian primates as in African pri- 
mates (Benveniste 1985). Figure 8.10 illustrates another example of a 
retardation in the clock in one lineage Such departures Irom constancy ol the 
clock rale pose a problem in using molecular divergence to date the times of 

Molecular Population Genetics 335 

— D jrillNinti 

rD mntmda 
"If D. pvii(hn)i>S( urn 
D fieiswnli" 

D mnlv^iui 

D. gwuu he 

D madciwisp; 

D incnn 
" D erccln 

i D trinket i 

1 — D, i/akulm 
I — D mi'lanognstcr 

I — D mauitlimm 
I — j- D tiimiliiti'. 

0.1 O sn hell <n 

Figure 8.1 Gene genealogy of Drosophila Adh sequences showing a signifi- 
cant slow-down of substitutions in the pseudoobsana clade (After Takezaki et al. 

existence of most recent common ancestors. Before this inference can be jus- 
tified, one needs to know that the set of species one is examining have a uni- 
form clock. 

PROBLEM 8.S The simplest way to test whether substitutions have 
occurred at the same rate in different organisms is to consider a tree 
like that in Figure 8.11. We expect that the divergence between A and 
C should be the same as the divergence between B and C if the clock 
is uniform on all branches. Tests of this hypothesis are known as rela- 
tive rate tests. Any site that underwent a substitution along the 
branch from X to C (but not on the other branches) will have the prop- 
erty that A=B*C. Sites that underwent a substitution on the branch 
from X to B (but not the other branches) will show A= C * B. Tajima 
(1993) showed that a simple and robust relative rate test could be per- 
formed by simply doing a chi-square test of the null hypothesis that 
the numbers of these two kinds of sites are equal. Suppose we observe 
sequences as follows: 


336 Chapter 8 

Figure 8.11 A .simple tree for illustrating the relative rate test of Tajima (1993). 

Calculate the observed and expected numbers of sites in the two cat- 
egories (A= B* C and A= C *B), and calculate the chi-square statistic 
to determine whether they are equal. 

ANSWER The observed number of sites for which A= B *C is 2, and 
for A = C * B there are 3 sites. Sites where A= B - C or A * B * C are 
ignored in this test. The expected number of sites of the two types is 
each (2 + 3)/2, so the chi-square tests gives (2 - 2.5) /2.5 + 
(3 - 2.5) 2 /2.5 = 0.2, which is clearly not significant. This example had 
insufficient data for an adequate test, but it provides an example in 
which there is no evidence for significant difference in rates. A more 
flexible but more involved test, based on maximum likelihood, can be 
found in Muse and Weir (1992). 

The Generation- Time Effect 

One observed feature of molecular evolutionary clocks is that their rate is 
approximately constant in a time scale measured in years. This is quite unex- 
pected because mutation rates are thought to be more nearly constant when 
measured in generations. However, the appropriate time scale of molecular 
evolution is not completely settled (Easteal 1985), as there is some evidence 

Molecular Population Genetics 337 

thai the rale of synonymous substitution in genes in the rodent lineage (short 
generation time) might be about two times as rapid as occurs in the same 
genes in the human lineage (Wu and Li 1985; Li and Wu 1987). Evidence 
from immunoglobulin genes further suggests that among mammals, the pri- 
mate lineage has the slowest rate of nucleotide substitution (Snkoyama et al 

Even if true, a nearly constant rate of gene substitution per year is not nec- 
essarily in conflict with a constant rale of neutral mutation per generation. 
The reason is that organisms with short generation times tend to be small 
and to maintain large population sizes. In such organisms, the proportion of 
nearly neutral mutations will be reduced because effective neutrality requires 
that Ns « 1, where s is the selection coefficient against the mutation. How- 
ever, the smaller proportion of nearly neutral mutations in these organisms is 
offset against the occurrence of more mutations per unit time than in larger 
organisms, because the generation time is shorter. Thus, the effects of short 
generation time and larger population size act in opposite directions and 
tend to cancel out (Crow 1985). 

Does the Constancy of Substitution Rates Prove the Neutral Theory? 

The possibility that gene substitutions might occur at an approximately con- 
stant rate gave some credence to the simplest version ot the neutral theory. 
Theoretical principle 2, discussed earlier in this chapter, states that the 
expected rate of substitution of neutral alleles equals the rate of mutation u 
to neutral alleles. Therefore, on the face of it, the occurrence of molecular 
clocks would seem to support the neutral theory. But when we dig a bit 
deeper into the predictions of the molecular clock, we find that things are not 
necessarily so simple. 

In a theoretically perfect molecular clock driven by a random process 
identical to that of radioactive decay (a Poisson process), the variance in the 
rate of ticking would be equal to the average rate of ticking. Tests based on 
the number of substitutions between pairs of species in three proteins 
showed that the variance was significantly larger than the mean (Ohta and 
Kimura 1971). Langley and Fitch (1974) backed this up by an analysis in 
which they estimated the number of substitutions on each branch of the phy- 
logenetic tree, and compared the mean and variance of these counts for each 
branch. Again, there was a highly significant excess variance Gillespie (1989) 
examined the ratio R of the variance to the mean number of substitutions in a 
set of four nuclear and five mitochondrial genes in mammals, and found that 
R ranged from 0.16 to 35.55. (The value of 35.55 is for cytochrome oxidase II, 
which shows 65 amino acid differences between human and mouse, 61 dif- 
ferences between human and cow, and only 21 differences between mouse 
and cow.) Gillespie argued that the large range of R implies a sixfold differ- 
ence among mammalian lineages in rates of nucleotide substitution This 

338 Chapter 8 

excess variance in substitution rate has been called an "episodic clock," char- 
acterized by periods of stasis alternating with periods of rapid substitution. 

Why does the clock appear to be episodic? One possible reason is that the 
substitution process is not really a simple Poisson process If instead the rate 
itself changes in a random or stochastic manner, the data could be fitted 
much better. Such a process, where the substitution rate for a Poisson process 
is itself stochastic, is called a doubly stochastic process, and it does indeed 
seem to fit the data better (Gillespie 1991). Such a compound Poisson process 
ought to show clusters of rapid change separated by periods of relative qui- 
escence, a pattern that is generally supported by the data (Gingerich 1986; 
Gillespie 1989, 1991) One means of causing variation in the substitution rate 
is natural selection in a stochastically varying environment, and such models 
can also fit the data satisfactorily (Gillespie 1986). Takahata (1987) has argued 
that the variance can be inflated by a "fluctuating neutral space" model, in 
which changes in selective constraints among lineages result in variation in 
substitution rate among lineages. The dynamics of substitutions are suffi- 
ciently complicated that a wide range of models can fit the data, but for now, 
one thing we are sure of is that the simplest Poisson process is not adequate. 


We have now seen several examples illustrating the general principle that 
nucleotide substitutions occur at a greater rate than amino acid replace- 
ments. The difference in rates, sometimes much greater than in these data, 
results from redundancy in the genetic code. As illustrated in Table 8.2, the 
codons for eight amino acids contain N (standing for any nucleotide) in their 
third position, seven terminate in Y (any pyrimidine, which means T or C), 
and five terminate in R (any purine, which means A or G). Coding sites con- 
taining an N are called fourfold degenerate sites because any of the four 
nucleotides will do, and those containing a Y or R are twofold degenerate 
sites (Li et al. 1985). Because of degeneracies, nucleotides in a gene can 
change without affecting the amino acid sequence. These changes are called 
synonymous or silent nucleotide substitutions. Nucleotide substitutions that 
do change amino acids are nonsynonymous substitutions. 

Calculating Synonymous and Nonsynonymous Substitution Rates 

In calculations involving synonymous and nonsynonymous nucleotide sites, 
the total number of synonymous sites is calculated as the number of fourfold 
degenerate sites plus one-third of the number of twofold degenerate sites. 
The total number of nonsynonymous sites in a coding region is defined as 
the number of nondegenerate sites (nucleotides in which any change results 
in an amino acid substitution), plus two-thirds of the number of twofold 
degenerate sites (the latter because, with random mutation at twofold 

Molecular Population Genetics 339 


Second nucleotide in codon 



TCW Ser 

TAR Stop 


TGA Stop 
TGG Trp 

CTW Leu 


CAY His 
CAR Gin 

CGN Arg 

ATG Met 

ACN Thr 

AAR Lys 

AGY Scr 
AGR Arg 

GTN Val 

GCN Ala 

GAY Asp 
GAR Glu 

CGN Gly 

Note, tn this representation of the standard genetic code, the symbol N stands for any nucleo- 
tide (T, C^Aor G), the symbol^ Igr any pyrimidine (T or C), and the symbol R for any 
purine ( Rpx CT- The H in the set of codons for isoleucine (lie) stands for "not-C" ( I, C or A) 
Degeneracies are as follows N represents a fourfold degenerate site, Y and R represent 
twofold degenerate sites. The H in the set of codons for isoleucine is consideicd as twofold 
degenerate, as are the first nucleotides in four leucine codons CTTA, TTG, C FA, and C I'G) and 
four arginine codons (CGA, CGG, AGA, and AGC) All other nucleotides are nondegenerate 

degenerate sites, two- thirds of the mutations are expected to result in amino 
acid changes). These conventions are illustrated above. 

PROBLEM 8.6 For the sequences of the region of the trpA gene 
given earlier, calculate the synonymous and nonsynonymous substi- 
tution rates. Start by using Table 8.2 to assign degeneracy classes to 
each site. For each difference between E. coli and Salmonella, the dif- 
ference is synonymous either if the site is fourfold degenerate or if it 
is twofold degenerate and the change is a transition (that is, A to G or 
the reverse, or T to C or the reverse). The difference is nonsynony- 
mous either if the site is nondegenerate or if it is twofold degenerate 

340 Chapter 8 


Molecular Population Genetics 341 

and the change is a transversion (that is, A or G to T or C). Equation 8.15 is used to 
estimate the proportion of nonsynonymous nucleotide substitutions per nonsyn- 
onymous site and the proportion of synonymous substitutions per synonymous 
site. The degeneracy assignments are therefore as follows: 

004 00-1 004 002 002 002 002 004 004 002 004 002 002 002 204 204 004 002 002 004 
Val Ala Pro lie Phe lie Cys Pro Pro Asn Ala Asp Asp Asp Leu leu Arg Gin lie Ala 


lie Ala Pro lie Phe lie Cys Pro Pro Asn Ala Asp Asp Asp Leu leu Arg Gin Val Ala 

ANSWER The stars above indicate differences with Salmonella and the letters 
below indicate which changes are nonsynonymous (N) and which are synonymous 
(S). Altogether there are 38 nondegerterate sites, 12 twofold degenerate sites, and 10 
fourfold degenerate sites. The total number of nonsynonymous sites is 38 + (2/3)12 
= 46, and the total number of synonymous sites is 10 + (1/3)12 = 14. There are three 
nonsynonymous changes (D = 3/46 = 0.065) and six synonymous changes (D = 
6/14 = 0.429). Now we use Equation 8.15 to estimate the proportion of nonsynony- 
mous nucleotide substitutions per nonsynonymous site and the proportion of syn- 
onymous substitutions per synonymous site. The number of nonsynonymous 
nucleotide substitutions per nonsynonymous site is k = 0.068, and the number 
of synonymous nucleotide substitutions per synonymous site is k ~ 0.635. 

Estimates of synonymous and nonsynonymous substitution rates for a 
mammalian protein-encoding gene are plotted in Figure 8.12. A striking 
observation is that the synonymous rates are generally much greater than 
the rates of substitution at nonsynonymous sites. These rates are scaled, so 
that if all mutations were equally likely to go to fixation, the rates would be 
equal. The depression in nonsynonymous substitution rate is interpreted as 
being caused by natural selection eliminating those changes that are delete- 
rious. There also appears to be greater variability of nonsynonymous rates 
than then? is in the synonymous rates, although even the latter vary by more 
than twofold. Figure 813 shows that the two rates are correlated, suggesting 
that either the mutation rates vary from gene to gene or that the constraints 
on nonsynonymous sites are somehow correlated with those on synony- 
mous sites. We shall see how this correlation might arise at the end of this 

2 3 

Divergence time (x 10 s ) 

Figure 8.12 Synonymous sites and nonsynonymous sites in B-globin undergo 
substitutions at different rates, but to a first approximation, both may appear to 
exhibit a clocklike substitution process. (From Li et al. 1985a.) 

One problem that may be apparent with the above method for counting 
synonymous and nonsynonymous sites is that the status of a particular site 
may change during evolution. The reason is that changes elsewhere in the 
codon may make a site that was formerly four-fold degenerate now become 
two- fold degenerate. In fact, the way the sites are tallied depends on the 
order in which they are considered. Another way to calculate nonsynony- 
mous and synonymous substitution rates is to consider each codon and 
count the number of changes that occurred. For codons that changed al a sin- 
gle site, the change is scored as synonymous if there was no alteration in the 

342 Chapter 8 

1 2 3 4 5 6 7 8 9 

Synonymous rate {x TO 1 *) 

Figure 8.1 J Plotting the data of Figure 8.12 in another way, the relative rates 
of synonymous and nonsynonymous substitutions vary somewhat, but in all 
cases synonymous rates are lower. (Data from Li, et al. 1985a.) 

resulting amino acid sequence, and nonsynonymous if there was an alter- 
ation. When there are two differences in a codon, then it is necessary to con- 
sider both orders of occurrence, and if we have no reason to assume one 
order is more likely, then both are considered equally likely. The two orders 
may have differing numbers of synonymous and nonsynonymous changes. 
For example, if a codon changes from CCG (proline) to AGG (arginine), it 
could have done so either through CCG->ACG (threonine)-»AGG or 
through CCG-^CGG(arginine)->AGG The first possibility entails two non- 
synonymous changes, whereas the second entails only one. If there are three 
changes in a codon, there are six possible orders in which they might have 
occurred. This all-possibilities method, by Nei and Gojobori (1986), seems 
like an improvement, but actually the estimates come out to be very similar 
to the method in Problem 8.6. Furthermore, even this method does not avoid 
the problems of sites changing status due to flanking changes. Far more com- 
plicated models are needed to fully avoid this problem, but in the end, the 
estimates that they give are also very similar to the simplest method outlined 
in Problem 8.6 (Muse and Gaut 1994; Goldman and Yang 1994). 

Molecular Population Genetics 343 

Paralleling the evolutionary rates for amino-acid-changing substitutions, 
the rates of nonsynonymous nucleotide substitution vary tremendously 
among different proteins. Among the slowest rates is that of histone H4, for 
which k - 0.004 x 10~ q substitutions per nonsynonymous nucleotide site per 
year, and among the fastest is that ol y-interferon, for which k = 2.80 x 10 ^ 
substitutions per nonsynonymous nucleotide site per year. The average rate 
among a large number of proteins is very close to the rate found in hemoglo- 
bin, which is 0.87 x 10" 9 substitutions per nonsynonymous nucleotide site per 
year (Figure 8.14). As in the examples given here, rates of nonsynonymous 
nucleotide substitution are usually quite similar to the rates of amino acid 
replacement in the same genes. 

In contrast with the highly variable rates of nonsynonymous nucleotide 
substitutions among proteins, the rates of synonymous substitution are much 
more uniform. For example, in mammalian genes, the fastest rate of synony- 
mous substitution is only 3 to 4 times greater than the slowest rate (see Figure 
8.14) However, the average rate, k - 4.7 x 10~ M substitutions per synonymous 
site per year, is not only greater than the average rate of nonsynonymous 
substitutions, but it is greater than the fastest known rate of nonsynonymous 
substitutions (for y-interferon). 

Synonymous rate 
( x 10~ 9 per year) 


a-Globin, Histone H3 

Interferon |i 


Growth hormone 

Nonsynonymous rale 
( x 10 4 per year) 

Jnleiferon |5 

Growth hormone 

Histone H3 

Figure 8.14 Comparison of rates of synonymous and nonsynonymous nucle- 
otide substitutions. Synonymous rates are generally much faster and much 
more uniform than nonsynonymous rates. (From Kimura 1986.) 

344 Chapter 8 

The great variability among proteins in the rate of nonsynonymous nucle- 
otide substitution, when contrasted with the much smaller variability found 
in the rate of synonymous substitutions, is illustrated graphically in Figure 
8 14. This disparity has been used as evidence in favor of the neutral theory. 
Interpreted according to the neutral theory, the variation in rates occurs 
because there are selective constraints on amino acid substitutions that do not 
operate as strongly on synonymous nucleotide substitutions Not just any 
amino acid will serve at a particular position in a protein molecule, because 
each amino acid must participate in the chemical interactions that fold the 
molecule into its three-dimensional shape and give the molecule its speci- 
ficity and ability to function. The need for proper chemical interactions and 
folding constrains the acceptable amino acids that can occupy each site. 
Although some amino acid replacements may be functionally equivalent or 
nearly equivalent, many more are expected to impair protein function to such 
an extent that they reduce the fitness of the organisms that contain them. 
Thus, the constraints on acceptable amino acids are selective constraints 
because unacceptable amino acid replacements are eliminated by selection. 

If an amino acid replacement does occur, its effect on the function of the 
protein product will depend on many factors, but one of the most important 
determinants of protein conformation is the charge of the amino acid. Differ- 
ent amino acid replacements give different numbers of charge changes, and 
in most cases the smallest change in charge might be expected to result in the 
smallest conformational change. Peetz et al. (1986) examined the charge 
changes in the evolution of seven proteins, and found that hemoglobin a, 
hemoglobin 3, myoglobin, and insulin all accumulated charge changes at a 
rate slower than expected by random substitution. This finding is consistent 
with constraints on the conformation of these proteins that limit permissible 
charge changes. On the other hand, cytochrome c and fibrinogens A and B 
accumulate charge changes at the expected neutral rate. 

For comparison of rates, it would be useful to study rates of nucleotide 
substitution in stretches of DNA wholly devoid of function and therefore 
subject exclusively to the whims of mutation and random drift. A likely can- 
didate is found in a class of genes called pseudogenes, which are DNA 
sequences that are homologous to known genes but that have undergone one 
or more mutations eliminating their ability to be expressed. Pseudogenes are 
thought to be completely nonfunctional relics of mutational inactivation, 
and, in fact, their extremely rapid rate of nucleotide substitution is offered in 
support of this view. The average rate of nucleotide substitution in pseudo- 
genes is faster than the average rate found in intervening sequences, flanking 
regions, and fourfold degenerate (synonymous) sites. Pseudogenes evolve at 
the fastest rates known, which may correspond to rates of substitution when 
DNA is completely unconstrained by natural selection. The fact that fourfold 
degenerate sites evolve more slowly than pseudogenes may be a suggestion 

Molecular Population Genetics 345 

that these sites are not totally lacking in constraint, an idea we shall return to 

Rates of nucleotide substitution also vary within protein molecules. 
Human insulin is a good illustration. The A and B polypeptide chains found 
in the mature insulin molecule are created by post-translational cleavage of a 
longer polypeptide known as preproinsulin. Preproinsulin contains a signal 
peptide for secretion and an internal C-peptide, neither of which are present 
in the active molecule The rates of nucleotide substitution in these three 
regions are 0,16 for the A and B chains, 0.99 for the C peptide, and 1.16 for the 
signal peptide. [As in Li et al. (1985), rates are expressed in terms of nonsyn- 
onymous nucleotide substitutions per nonsynonymous site per billion 
years.] In insulin, while there is a sevenfold difference between the maximum 
and minimum rates of nonsynonymous substitution in different regions of 
the molecule, the rates of synonymous substitution differ only twofold 
Moreover, there is a negative correlation between functional importance and 
rate of nonsynonymous substitution within the insulin molecule Many 
diverse amino acid sequences can serve as signal peptides provided they are 
hydrophobic, which suggests that selective constraints on signal peptides 
maybe reduced in comparison with sequences in mature polypeptides. In 
insulin, as expected, the rate of nonsynonymous substitution is fastest in the 
signal peptide and slowest in the functional subunits of the mature molecule. 
This kind of negative correlation between selective constraint and substitu- 
tion rate has also been observed in several other proteins (Li et al. 1985). 

Within- Species Polymorphism 

So far we have talked only about differences between nucleotide sequences of 
genes from distinct species. DNA sequence differences between alternative alle- 
les of the same gene in a single species may also be synonymous or nonsyn- 
onymous, and it is instructive to compare levels of within-species polymor- 
phism at synonymous vs. nonsynonymous sites. In this case we do not general- 
ly talk about substitution rate, but rather quantify the variability with the nucle- 
otide diversity. Nucleotide diversity, often symbolized with the Greek letter n, 
is the probability that a sample of a particular nucleotide site drawn from two 
individuals will differ. It is essentially the heterozygosity at the nucleotide level. 
Figure 8 15 illustrates the first systematic study of DNA sequence variation 
in a set of 11 alcohol dehydrogenase alleles of Drosopliila niclmiogaster (Kreitman 
1983). Of the 2659 nucleotides sequenced, 52 were variable across the 11 alleles. 
The nucleotide diversity over the entire gene was 0.0065 ± 001 7, meaning that 
99.4% of the time, pairs of alleles will match at a site. The level of nucleotide 
diversity differs in different regions of genes. Figure 8.16 illustrates the esti- 
mates of nucleotide diversity found in different parts of the Dmsophitn Adh 
gene The different parts are the 5' (upstream) flanking region, the 5' tran- 
scribed but untranslated region, the coding region (nonsynonymous substitu- 

346 Chapter 8 

5' flanking 

F ^HraomT 


lender Exon 2 Intron 2 

Exon 3 

Consensus C C G 

1 >F 
11 F 

AT . . , 




AG , 




T .A 

A . TC 



G T 
G T 

Intron 3 


y untra minted 

w region 


3' flanking sequence 

▼ A 

C — T - 


. . T . 1 . C A 
-GTC7CC - 

crcicc . 
.cicrcc . 


. c rc tc c 


C 4 . 

C 4 . 

C 4 G 

C 4 G 

C 5 G 

C 4 . 

Figure 8.15 Polymorphic nucleotide sites among 11 alleles of the Adh alcohol 
dehydrogenase gene of D, mritmagaskr. The first line gives a consensus 
sequence for Adh at sites that vary; subsequent lines give the nucleotides from 
each copy for the polymorphic sites. A dot indicates that the site is identical to 
the consensus sequence. The triangles indicate sites of insertion or deletion rela- 
tive to the consensus sequence. The star in exon 4 indicates the site of the amino 
acid replacement (threonine-to-lysine) responsible for the Fast-Slow mobility 
difference in the Adh protein. (After Kreitman 1983.) 


Molecular Population Genetics 347 


£■ 0.05 


| 004 


9, 0.03 

- 5' flank 


Figure 8.16 Nucleotide diversity in Adh of Drosoplnhi melanogasfcr 

tions only, with both the slowest and fastest rates shown), intervening 
sequences, the 3' (downstream) transcribed but untranslated region, and the 
3' untranscribed region. On the average, the fastest rates of substitution occur 
in intervening sequences and the 3' flanking regions, but the average rates in 
the 5' flanking regions and the 3' untranslated region are all substantially 
faster than 0.8K x \0~ 9 , which is the average rate of nonsynonymous substitu- 
tion in coding regions (see Figure 8.14). Neutralists would argue that the high 
rates of substitution in noncoding regions and variation among different parts 
of the coding region result Irorn varying degrees of selective constraints on 
different parts of the gene. It is to be emphasized that Figure 8.16 depicts the 
results for just one gene, and in individual instances, especially in compar- 
isons of closely related species, there may be fewer substitutions observed in 
flanking sequences than in coding sequences, or fewer changes in synony- 
mous sites than in nonsynonymous sites. 

Comparison of nucleotide diversity in different functional regions of a 
single gene can reveal features of the gene's evolutionary history. For exam- 
ple, in 71 sequenced Adh genes of DrosophUa mclnitogaster (Kreitman 1983), 
among 14 substitutions that were observed in the coding region, 13 were 
silent substitutions. Considering the genetic code and the codon usage in the 
Adh gene, it is possible to calculate what portion of the substitutions would 
be silent if all substitutions occurred with equal frequency. This figure is 
about 30% in the case of the Adh gene in Drosopltiin, which implies that about 
70% of the substitutions would be expected to cause amino acid replace- 
ments. Since only one out ol 14 observed substitutions was an amino acid 
replacement, such substitutions are greatly underrepresented. This finding is 
consistent with the view that most amino acid replacements are eliminated 
from the population by purifying selection. The same logic can be extended 

348 Chapter 8 

to argue that sequences that are conserved are likely to be functionally 
important this type of reasoning led to the identification of a new open read- 
ing frame in the HFV (AIDS) virus genome (Miller 1988). 

The action of natural selection can sometimes be inferred from levels of 
synonymous and nonsynonymous polymorphism For genes that determine 
surface antigens of pathogens or those that determine the major histocom- 
patibility antigens of mammalian cells, the rates of nucleotide substitution 
can be quite high. One way to address whether the high rate of substitution 
is driven by selection is to examine the levels of synonymous and nonsyn- 
onymous diversity in these genes. For example, Hughes and Nei (1988) 
found that in the regions coding for the antigen recognition sites in the 
class 1 MHC (major histocompatibility complex) genes of humans and mice, 
the rate of nonsynonymous substitution exceeded the rate of synonymous 
substitution by a ratio of 3 : 1. This ratio is the reverse of that found in the 
usual situation and in other regions in the same genes, where silent substitu- 
tions are present in excess. The excess of amino acid replacements is consis- 
tent with a model in which mutations that generate diversity are often 
advantageous, and hence natural selection accelerates the substitution 
process. Endo et al. (1996) developed software to scan the gene sequence 
databases for cases in which the nonsynonymous rate significantly exceeded 
the synonymous rate, and they recovered 17 cases. Nine of these 17 cases 
were cell surface antigens or immune system genes — proteins for which one 
can easily imagine scenarios in which high levels of diversity are advanta- 
geous. High rates of nonsynonymous substitution are also found in protein 
toxins called cohcins that certain bacteria produce to kill potential competi- 
tors in their immediate vicinity (Riley 1993; Ayala et al. 1994). 

Implications of Codon Bias 

Synonymous substitutions occur at a greater rate than nonsynonymous sub- 
stitutions, implying that they face weaker selective constraints. But are syn- 
onymous changes completely neutral, or do they too face some form of con- 
straint? One potential type of constraint occurs through codon preferences, 
which are correlated with the relative abundance of tRNA molecules that 
interact with and translate the codons. In bacteria and yeast, for example, 
highly abundant proteins tend to use codons for abundant tRNA molecules, 
whereas proteins produced in small amounts tend toward codons for less 
abundant tRNA molecules (Jkemura 1985). A plot of the frequency of use of 
the synonymous codons that code for leucine shows that CUG is much more 
frequent than the others, corresponding to an increased abundance of this 
tRNA. A second potential constraint on synonymous substitutions occurs 
through possible secondary structures that the RNA might form, in which 
certain nucleotides must undergo base pairing (see the next section for an 
elaboration). Pre-messenger RNA secondary structure may influence the 
speed or accuracy of intron splicing, rate of transport, or stability. A third 

Molecular Population Genetics 349 

potential constraint on synonymous substitutions is related to the fact that, 
during translation, the probability of misincorporation of the wrong amino 
acid increases if (here is a pause while the translation machinery waits to 
find a rare tRNA. Such translation errors are known to occur (in fact, mis- 
translation of an mRNA that bears a fiameshift mutation can yield an active 
protein). Pausing during translation may also be ol importance to the fold- 
ing of the protein into its proper three-dimensional structure. 

If synonymous codons are neutral, then one would expect their frequen- 
cies of use to correspond to the product of the nucleotide frequencies. If all 
four bases were equally frequent, all synonymous codons should be used 
equally frequently. A more subtle way to test for departure from equal codon 
use is to count the incidence of polymorphisms and substitutions toward or 
away from the most abundant codon. If the most abundant codon became 
the most abundant by chance, then the substitutions toward and away from 
this codon should show no bias. But if the most abundant codon is "pre- 
ferred," then there will be a deficit of substitutions away from this codon 
Application of this kind of approach for codon bias in E. colt suggested an 
average selection coefficient against disfavored codons of about s = 7.3 x 10" 9 
(Hartl et al. 1994). Even Drosophih, whose effective size might be around 10*, 
appears to exhibit significant codon preference (Akashi 1995), suggesting 
that selective constraints on synonymous codons must be greater than W 6 in 
this organism (Figure 8.17). 

From the selectionist viewpoint, while granting that substitutions in 
pseudogenes may be neutral, and synonymous substitutions may be con- 
strained by natural selection only weakly, it is nevertheless maintained that 
nucleotide substitutions that change amino acid sequences are inevitably 
subject to the action of natural selection of an intensity that is sufficient to 
counteract the effects of random genetic drift. Thus, selectionists would argue 
that amino acid substitutions that have occurred in a protein during the 
course of evolution became fixed by natural selection because they increased 
the fitness of the carriers through improvement in function of the molecule. 
However, neutralists argue back, the selectionist viewpoint cannot easily 
explain the negative correlation between functional importance and rate of 
substitution within proteins Furthermore, a neutralist might add, even a 
slightly detrimental mutation has some chance of being fixed unless a popu- 
lation is very large (Chapter 7). 


The effects of varying the neutral mutation rate on levels of polymorphism 
within a species and the interspecific divergence in nucleotide sequences are 
plotted in Figure 8.18. The theory is consistent with the idea that genes with 
a high rate of nucleotide substitution, as indicated by a large number of 

350 Chapter 8 


4341 1566 1417 777 705 351 Observed 
1767 1241 179.3 1259 1535 107R Expected 

Figure 8.1 7 The frequency of the six codons that encode leucine in 
Drosophila melmiogaster is not uniform. This kind of codon bias, in which one 
codon is present in excess, is commonly observed. (Data from FlyBase, 

interspecific sequence differences, should also have a high level of intraspe- 
cific polymorphism. Polymorphism depends only on the product of the neu- 
tral mutation rate and the effective size, through the formula H = 9/(1 4- 9) 
that we encountered in Chapter 7 For strictly neutral genes, interspecific 
divergence does not depend on the population size, but instead follows the 
formula Jt = 2uf. If we compare two genes, the level of intraspecific polymor- 
phism would let us estimate a B value for each gene. Given the value for 
gene A and the observed interspecific divergence, an estimate of the diver- 
gence time could be estimated. For gene B, we would also have a e esti- 
mated from the level of polymorphism, and we could use the divergence 
time estimated from gene A to determine a predicted value of divergence 
in gene B. 

Molecular Population Genetics 351 

e = 4N[i 

Figure 8.18 Reasoning behind the HKA test. Consider two genes, A and IJ, 
that differ in neutral substitution rate. B can be estimated for each gene based on 
observed levels of nucleotide heterozygosity (top panel) Given the observed 
divergence between two species in gene A (determined by the neutral mutation 
rate and time), the divergence in gene B can be predicted based on its neutral 
substitution rate, and the divergence time obtained from gene A. The HKA test 
is a goodness-of-fit test to the observed levels of intraspecific diversity and inter- 
specific divergence under a model whose parameters ,ue population sizes, neu- 
tral mutation rates, and times of divergence. 

The above reasoning has been formalized in a popular test of neutrality 
based on nucleotide sequence data within and among species (I ludson et al. 
1987) Sequences of at least two genes from a number of individuals of each 
of two species are needed to apply the test. Define 5* and S 1 ,* as the number 
of polymorphic nucleotide sites in gene / in species A and B, respectively, and 
d, as the number of differences in gene / between a pair of alleles sampled 

352 Chapter 8 

randomly, one from species A and one from species B. The expected values of 
these parameters are obtained from the infinite-sites neutral model, assum- 
ing that the two species diverged t generations ago, that the population sizes 
are 2W and 2Nf, and that each gene has an associated 9, = 4Nu r Estimates of 
0,,/, and f are obtained by a least-squares method that gives the best fit of the 
expressions for the expected values and variances ol S* S?, and d, to the data, 
and goodness-of-fit is tested with an appropriate chi-square test. Using data 
from the Adh coding and 5' flanking regions in D melanogaster and D. sechel- 
\va r Hudson et al. (1987) found that the observed values deviated significant- 
ly from the neutral model in a direction consistent with the operation of 
balancing selection acting on the coding region of Adh. This finding is consis- 
tent with Kreitman's (1983) observation of an excess of silent substitutions in 
Adh, except that the test of Hudson et al. makes use of the genetic variation 
observed within and among species. The "HKA" test has seen many appli- 
cations in molecular population genetics (Kreitman and Hudson 1991; 
Aguade et al. 1992; Begun and Aquadro 1993; Gaut and Ciegg 1993). 

PROBLEM 8.7 In a set of 12 Adh sequences in Dros&phih met- 
anogaster, McDonald and Kreitman (1991) observed 42 silent (synony- 
mous) polymorphisms and two replacement (nonsynonyrnoua) 
polymorphisms. As had been concluded by Kreitman (1983), this 
suggests that most replacement mutations are deleterious and are 
eliminated from the population. When they examined fixed differ- 
ences between melanogaster and either D. simulans or D. yakuba, they 
found that seven of the fixed differences were replacements and sev- 
enteen were silent. What is the significance of this observation? 

ANSWER A null hypothesis might be that the effects on fitness of a 
mutation would be the same whether within a species or at any time 
along the ancestral history of two species back to the common ances- 
tor. If this is true, then we would expect the ratio of silent to replace- 
ment polymorphisms to be the same as the ratio of silent to 
replacement fixed differences. A simple test of this is to do a 2 x 2 con- 
tingency chi-square: 

Fixed Potymotpblc 





Molecular Population Genetics 353 

For this table we get % 2 = 8.20, and with one degree of freedom, 
P < 0.01. (A correction is often applied to the chi-square for tables with 
counts less than 5, but it does not make much difference in this case.) 
The low probability means we reject the null hypothesis and conclude 
that, within species, there is a tendency to avoid replacement poly- 
morphisms; however, between-species replacement differences are 
much more likely to occur. McDonald and Kreitman (1991) argue that 
this pattern is consistent with adaptive fixation of amino acid replace- 
ments, since they are relatively more frequent in interspecific compar- 
isons, and such adaptive polymorphisms would be less common than 
neutral polymorphisms because adaptive differences would not 
remain polymorphic for as long a duration. This simple test is useful 
in assessing the relative importance of neutral drift versus selection in 
interspecific differences. 

Impact of Local Recombination Rates 

Recall from Chapter 5 that the level of polymorphism in Drosophila shows a 
striking correlation to the local rate of recombination. Regions of low recom- 
bination rate are nearly devoid of variation, whereas regions with high rates 
of recombination are highly polymorphic. The idea of comparing polymor- 
phism and divergence makes this pattern even more striking and allows us 
to eliminate a possible cause. One possible reason for the correlation is that 
recombination itself is mutagenic, or that somehow the two processes are 
related mechanistically. (That is, perhaps when mutations occur, the DMA 
configuration is altered to increase the chance of recombination.) If this were 
the case, then the regions of low recombination rate should also have a low 
mutation rate, and hence lower interspecific divergence. Figure 8.19 shows 
that a lower divergence is not observed Levels of interspecific divergence are 
independent of local recombination rates. The conclusion is that the correla- 
tion between recombination rates and levels of polymorphism observed by 
Aquadro et al. (1994) must be due to more rapid elimination of the variation 
in regions of low recombination. 

Two known mechanisms that remove variation faster in regions of low 
recombination are selective sweeps and background selection Background 
selection is thought to be the primary mechanism for the reduced variation 
(discussed in Chapter 5 in the section on linkage and recombination), but this 
does not mean that sweeps do not occur. Selective sweeps occur when a 
favorable mutation takes place, and selection rapidly increases its frequency. 
Such sweeps can have a dramatic effect on levels of variation in the selected 

354 Chapter 8 



g 006 





.a 004 



1 ™ 2 






01 02 03 04 

Coefficient of exchange 

Figure 8. 19 The striking correlation between local rates of recombination and 
levels of intraspecific nucleotide diversity cannot be explained by a lower muta- 
tion rate in regions of low recombination. If regions ol low recombination had 
low rates of mutation, the interspecific divergence would be lower in these 
regions That it is not is shown by these data. (From Aquadro et al. 1994.) 

gene and the region around it. The size of the "swept" region depends on the 
rate of recombination and is larger for regions of low recombination. This 
means that the chance that a particular site has been swept free of variation is 
greater in regions of low recombination, assuming the density of selective 
sweeps is uniform across the genome. An example of a selective sweep is an 
esterase B allele in the mosquito that is associated with pesticide resistance 
(Figure 8.20). The resistant allele has apparently undergone a nearly global 
sweep, judging from the near monomorphism of the esterase B gene (Ray- 
mond ct al. 1991; Ffrench Constant etal. 1991). We do not know how frequent 
such sweeps are, but one possible means of identifying them is to score many 
highly polymorphic markers in many populations and look for regions of 
reduced variation. Schlotterer et al. (1997) performed such a survey and 
found several cases of individual genes in single populations that were 
depauperate in variation, perhaps due to a local sweep event. 


There is an important distinction between the construction of trees from 
sequences of genes from different species and from sequences of alleles from 
a single species. The former yields a customary phylogenetic gene tree, 
while the latter produces what is called a gene genealogy. The relationships 
among species result from macroevolutionary processes, whereas allelic dif- 
ferences result from a number of microevolutionary processes, including 
aspects of genetic transmission. Once the nucleotide sequences of alleles are 


Molecular Population Genetics 355 

8 gene 









Y 1 T - 




T [I 



i ii i ill \r = 




- I I I II I I II ll r -m-r- II I T I 





I I I I I M I I ll ll l l h = _U I X_dr 

R B 



Ivory Coast 


R B 



R B 

R B 

Figure 8.20 Restriction maps of the esterase B gene from global samples of the 
mosquito Cukx pipiens. Note the identity of a haplotype from Egypt through Texas. 
This haplotype is associated with insecticide resistance, and probably underwent a 
global sweep in the face of strong selection (From Raymond et al. 1991 ) 

known, the different alleles can be treated like genes in different species in 
applying standard methods for inferring a phylogenetic tree. However, great 
care is needed in constructing gene genealogies, because recombination 
among the sequences results in a gross violation of the assumptions of most 
tree- building methods. Provided the rate of recombination is not too high, 
localized blocks of sequence can be identified in which there appears to have 
been no recombination in the ancestral history of the sampled alleles. With 
this caveat, gene genealogies can be of great use in inferring the evolutionary 

356 Chapter 8 










0.010 0.008 006 0004 0.002 

Number of nucleotide differences per site 

Figure 8.21 A phylogenetic free for 11 Adh alleles of Drosopkila melanogaster 
based on 43 nucleotide differences The scale is the number of nucleotide differ- 
ences per site. Ja: Japan; Af: Africa; Wa: Seattle, Washington, Fl: Southern Flori- 
da; Fr France. S and F refer to the slow and fast electrophoretic forms, (Data 
from Kreitman 1983.) 

history of a polymorphism. For example, they can reveal which of a group of 
alleles is older, or which alleles are more closely related to each other. Figure 
8.21 shows the gene genealogy from Kreitman 's (1983) Adh sequence data, 
and the higher diversity of the S allele clearly makes it appear to be older. 

Hypothesis Testing Using Trees 

Beyond the descriptive approach to showing relationships among alleles, 
gene genealogies can be used lo test fundamental forces of population genet- 
ics, including natural selection. For example, consider a phylogenetic tree 
based purely on neutral variation. As illustrated in Figure 8.22A, when the 
substitution rate is u, the expected time to coalescence to a common ancestor 
for a randomly chosen pair of alleles is AN generations (Chapter 2). Under a 
model like Ohta's (1973), where many mutations are slightly deleterious, the 
tree is not changed very much because the alleles included in a sample are 

Molecular Population Genetics 357 

(A) No selection 

J= 05 



(B) Purifying selection 


Figure 8.22 Computer simulations of the infmite-allole model of molecular 
evolution. (A) With strict neutrality, the expected lime from mutation to fixation 
of alleles that will go to fixation is 4N„ generations (B) Purifying selection (in 
this case with half of the mutations having a fitness of 5) results in less poly- 
morphism at any given time. (C, next page) Stabilizing selection (overdomi- 
nance or frequency dependence) can retain alleles in a polymorphic state for 
much longer times. Representative trees are plotted to the right of each panel. 

358 Chapter 8 

(C) Stabilising selection 

£ 5 



the subset of mutations that occurred that were nearly neutral On the other 
hand, in the case of adaptive mutations, the rate of fixation would be much 
faster than with neutrality, so that sites of adaptive mutation would have 
shorter coalescence times than Hanking neutral sites (Figure 8.22B). Finally, 
with balancing selection (heterozygote advantage), polymorphisms would 
be maintained for a longer time than under the pure drift model (Figure 
8.22C). The number of statistical methods for inference of population genet- 
ic forces from gene genealogies is increasing rapidly, and there is ample 
opportunity for exciting progress in this area. 

PROBLEM 8.8 A study of variation in the gene encoding superox- 
ide dismutase in Drosophite metamgaster {Hudson et al. 1994b) 
revealed 63 polymorphic sites in three slow alleles and 22 fast alleles 
(where fast and slow refer to the mobility of the protein product in an 
electrophoretic gel). An additional 16 slow alleles were separately 
scored, giving a total of 19 slow alleles that were found to be identical 
in nucleotide sequence. The fast allele broke into 10 distinct haplo- 
types, and the most common was FastA with nine copies. The partial 
table of pairwise counts of numbers of sites that differ between 
alleles is: 

Molecular Population Genetics 359 

























How would you address the question of whether this sample is typi- 
cal of a sample from a neutral gene? 

ANSWER The aspect of the pattern of variation that is unusual is 
that the fast alleles appear to be quite variable, whereas all 19 slow alle- 
les are identical. A gene genealogy of the jasf alleles would look like a 
typical neutral tree with roughly exponentially distributed branch 
lengths, but the complete tree would then have 19 identical slow alle- 
les placed one substitution away from FasiA. The suspicion is that 
the slow allele must have arisen recently and is being pulled to high 
frequency by selection. An observed increasing trend in the slow allele 
frequency supported this conjecture, To make a formal tesl out of this 
observation, Hudson et al. (1994b) used the coalescent procedure 
described in Chapter 7 to generate simulated data sets with a sample 
sLze of 25 and having 63 polymorphic sites. For each of the 10,000 sim- 
ulated samples, they asked, how often is there a set of 12 alleles that 
differ by or 1 substitutions? (The 9 FastA alleles and 3 slow alleles in 
the original observed sample differ at just 1 site.) The answer was 81 
of the 10,000 cases, giving a probability of 0.0081 . The observed sam- 
ple is not a likely occurrence under neutrality. 

It is instructive to note that these data were consistent with neutrality by 
the Fu and Li (1993) test, the Tajima (1989) test, and the I IKA test (Hudson et 
al 1987), demonstrating that even strong departures from neutrality may be 
missed by these standard tests. This problem illustrates a common principle 
in molecular population genetic analysis, which is that ad hoc approaches tai- 
lored to particular observations often are necessary. 

The topology of gene trees affords an opportunity for yet another test of 
goodness of fit of data to the neutral theory. We saw in Chapter 7 that the 
coalescent approach provides a description for the expected topology of a 
gene tree under the infinite-sites model. In particular, the expected time 
back to the next pieceding coalescent event is exponentially distributed 


360 Chapter 8 


Molecular Population Genetics 361 

with parameter 1/ where k is the current number of distinct alleles. A 

test of Fu and Li (1993) makes use of the fact that the model predicts a rela- 
tionship between 6 and the number of "external mutations." An external 
mutation is a mutation that occurs on a branch of the gene genealogy that 
terminates in an observed allele (an external or terminal branch). The 
remarkable observation that Fu and Li made was that the expected number 
of external mutations is 9, independent of sample size. The test is based on the 
idea that selection will affect the number of external branches more than it 
will affect internal branches, and Fu and Li devised test statistics for good- 
ness of fit between observed and expected numbers of external mutations. 
The test has some advantages over the Tajima test, but Simonsen et al. 
(1995), after extensive simulations to test the power of various neutrality 
tests, conclude that the Tajima test (see Problem 7.10) is generally the most 
powerful against alternative hypotheses of selective sweeps, population 
bottlenecks, or population subdivision (but see Problem 8.8). 

Inferences about Migration Based on Gene Trees 

Data from a panmictic population obeying the infinite-site model will have a 
characteristic gene tree topology. II the population is divided into two semi- 
isolated groups, the alleles within each group will, on average, be more sim- 
ilar to one another than comparisons between groups. This would mean that 
a gene tree in such a subdivided population would have two major clades 
corresponding to the two populations For higher levels of migration, the 
gene tree will be somewhere between these two extremes. Slatkin and 
Maddison (1989, 1991) devised means for estimating the number of migrants 
per generation, Nm, from the inferred gene genealogy. In essence, the 
approach uses parsimony to obtain a direct count of migration events com- 
patible with the tree, and Nm is estimated from this count. 

With sufficient DNA sequence data, one can be confident that identical 
alleles are truly identical by descent, an important aspect of inference of pop- 
ulation history and migration. In their analysis of D. pseudoobscura Adh 
sequences, Schaeffer and Miller (1991) found that geographically distant pop- 
ulations had identical alleles, and more generally, that the gene tree did not 
partition geographically, as though the population were panmictic. This was 
an exciting result, given the extraordinary level of population subdivision in 
D. pseudoobscura third chromosome inversions. It implies that the latter sub- 
division is not just a historical accident, but is being maintained in the face of 
sufficient migration to homogenize other sorts of genetic variation. 

The data of Bowcock et al. (1994) show yet another aspect of very high 
resolution molecular data. After constructing a tree based on 30 microsatellite 
loci, they observed that human samples showed a significant tendency to 
cluster by continent Although lower- resolution methods had shown some 
degree of dissimilarity among groups of humans, this was the first study to 

show that reduced intercontinental migration was sufficient to partition 
human genetic variation. 


We already saw in Chapter 5 that mitochondrial DNA can be highly infor- 
mative about the geographic structure of populations. Some of the advan- 
tages of using DNA sequence variation from this organelle genome include- 

• The DNA molecule (in most animals) is relatively small and easy to isolate. 

• It is present in multiple copies per cell; therefore, older and less well pre- 
served samples are still likely to yield useful information. 

• The mitochondrial genome does not undergo recombination, so it is more 
likely to show a clean branching structure to its gene trees. 

• It evolves rapidly. 

The primary problems with mtDNA are; 

• The absence of recombination means that the gene tree constructed from 
any mitochondrial DNA gene will reflect just a single realization of the 
genealogical process. As such, the data will not be as informative about 
species or population trees as, say, a dozen nuclear genes. 

• Much of the work with mtDNA has been on the control region sequence. 
While this region is highly variable, the variability occurs at a subset of 
sites that are so mutable that multiple substitutions often occur. 

In animals, mitochondria are usually inherited through the egg cytoplasm 
(maternal inheritance) and are genetically uniform within an individual The 
mitochondria] genome consists of a single circular DNA molecule, denoted 
mtDNA, the size of which varies over a remarkably narrow range in different 
species of vertebrates (15.7-19.5 kb), averaging about 16 kb. Human mtDNA 
is fairly typical, containing a control region for the initiation of DNA replica- 
tion, genes for two ribosomal RNA molecules, 22 transfer RNA molecules, 
and 13 proteins. Twelve of the proteins are subunits of enzyme complexes 
that carry out electron transport and ATP synthesis. The genetic code of 
mammalian mitochondria differs from the standard code in that ATA codes 
for Met, TGA codes for Trp, and AGR codes for End (termination of protein 
synthesis); thus, every codon in the mitochondrial code can be written as 
either NNY or NNR. Animal mitochondria also contain several hundred 
enzymes used in metabolic functions, but these are coded for by nuclear 
genes, and the enzymes are transported into the mitochondria. 

At the nucleotide level, the rates of substitution in mammalian mtDNA 
are typically 5 to 10 times greater than occur in single-copy nuclear genes, 
averaging approximately 10 x 10"" substitutions per nucleotide site per year. 
The reason for the high rate of substitution is thought to be either a high rate 
of nucleot ide misincorpora tion or a tow efficiency of repair of the DNA poly- 
merase. Support for the latter view comes from the observation that, unlike 

362 Chapter 8 

20 40 W 

Divergence time (million years) 

Figure 8.23 Relationship between percent sequence divergence (WOd) and 
divergence time. The points represent estimates from pairwise comparisons of 
restriction endonuclease cleavage maps The initial rate of mtDNA sequence is 
shown by the longer dashed line and the rate of divergence of single-copy 
nuclear DNA by the shorter dashed lme. (From Brown et al 1979.) 

the nuclear DNA polymerase, the mitochondrial DNA polymerase lacks the 
proofreading function. In protem-coding mitochondrial genes, the rate of 
synonymous substitution is about five times greater than the rate of nonsyn- 
onymous substitution, which is comparable with the ratio found in nuclear 
genes Mitochondrial tRNA genes in mammals evolve approximately 100 
times as rapidly as their nuclear countei parts (Brown 1985; Avise 1986). One 
result of this faster rate of nucleotide substitution is that the divergence 
between two sequences saturates relatively soon, so that the linearity of 
divergence over time (the molecular clock) is an accurate approximation only 
for species that have diverged less than about 10 million years (Figure 8.23). 
Exceptions to the elevated rate of mtDNA divergence have been found, 
notably in Drosophihi (Powell et al. 1986) 

PROBLEM 8.9 The mitochondrial DNA of 21 humans of diverse 
geographic and racial origin were digested with 18 restriction 
enzymes, 11 of which exhibited one or more fragments in which size 
polymorphism occurred (Brown 1980). All restriction site polymor- 
phisms could be explained by single-nucleotide differences, thus 
there was no evidence for insertions, deletions, or other mtDNA 
rearrangements. Altogether, 868 nucleotide sites were assayed for dif- 
ferences among individuals, and the average number of differences 
per nucleotide site per individual was estimated at 0.0018. Assuming 
that mammalian DNA undergoes sequence divergence at the rate of 5 to 

Molecular Population Genetics 363 

10 x 10 -9 nucleotide substitutions per site per year, and that the rate is 
uniform in time, calculate the length of time since all of the 21 contem- 
porary mtDNA molecules last shared a common ancestor. Calculate the 
effective size of the population from the level of mtDNA variability. 

ANSWER Given an average number of differences per nucleotide site 
per individual of 0.0018 and an average rate of divergence of 5 to 10 x 
10" g per site per year, the time of the most recent common ancestor 
would be between 0.0018/(10 x 10^) and 0.0018/(5 x 10"*) or 180,000 to 
360,000 years. Assuming a generation time of 20 years, this means that 
all mtDNA in the diverse sample could have been from a single female 
in the population between 9,000 to 18,000 generations ago. To estimate 
the long-term effective size of the population, recall that the expected 
time to fixation of a newly arisen neutral mutation is 4W, generations. 
This result applies to an autosomal gene in a diploid species. For mito- 
chondrial genes, only females transmit them, and they are effectively 
haploid, so the corresponding fixation time for mtDNA is just N, gener- 
ations. If we argue that the one mtDNA type went to fixation in 9,000 to 
18,000 generations, this is equivalent to saying that the long-term pop- 
ulation size has been N e = 9,000 to 18,000. This sounds like a low num - 
ber, but modern anthropologists find it reasonable, given the 
population structure of ancient humans and the rapid, nearly starburst- 
like growth since the adoption of agricultural methods. 

One of the most dramatic claims in the history of population genetics was 
that human genetic variation in mtDNA indicates n recent African origin of 
modern humans (Cann et al. 1987). This claim was based on restriction site 
variation among mtDNA of 147 humans in five populations. The 12 restric- 
tion enzymes sampled an average of 370 restriction sites per individual, 
equivalent to assaying 9% of the mtDNA genome per individual A total of 
195 polymorphic sites were found in the genome, and the precise location on 
the mtDNA sequence of all polymorphic sites was identified. When the 133 
distinct mtDNA haplotypes were assembled into a phylogencfic tree, a clade 
was lound in which the most ancient branch pointed to a group of people of 
African ancestry (Figure 8 24). Given the observed number of differences 
between the two most divergent mtDNA types, and assuming (here is 2 to 
4% divergence in mtDNA sequences per million years (estimated from the 
human-chimp split at 5 MYA), the common ancestor to all of the observed 
haplotypes was estimated to have existed 140,000 to 280,000 years ago 

364 Chapter 8 

0.2 0.4 6 

Sequence divergence % 

6 4 0.2 

Sequence divergence % 

Figure 8.24 Parsimony tree of mtDNA variation from the original "mitochon- 
drial Eve" paper Much was made of the observation that there is an isolated 
clade consisting only of Africans. (From Cann et al. 1987.) 

Molecular Population Genetics 365 

Sequences of the control region, which diverge at a rate of 1 2 to 15% per mil- 
lion years, produce a date for the common ancestor of 166,000 to 249,000 
years ago (Vigilant et al. 1991). Several other data sets have been collected to 
address the issue of date, and all have produced estimates of the date of the 
common ancestor of human mtDNA of between 100,000 and 400,000 years 
ago (Hasegawa and Horai 1991; Pesole et al. 1992; Ruvolo et al. 1993). These 
figures, and their interpretation, have launched a controversy centering on. 
(1 ) the best way to infer the time of the common ancestor, (2) the meaning of 
higher African diversity, (3) the confidence in an African root, (4) the neutral- 
ity of human mtDNA variation, and (5) the implications for human evolu- 
tion. Whether modern humans migrated out of Africa in the past 200,000 
years may not he supported with statistical rigor by mtDNA alone (Temple- 
ton 1993), but when haplotypes of nuclear genes (Tishkoff et al. 1996), or 
when many nuclear genes are considered in addition, the case for African 
origin is strong (Nei and Roychoudary 1993). 

We must be careful to realize, however, that the fact that Africa has the 
greatest genetic diversity today does not by itself guarantee that modern 
humans originated in Africa. If the African population has had a long-term 
effective size much larger than other populations, or if the other populations 
suffered a bottleneck that Africa did not, then Africa would be more diverse 
no matter where humans originated (Relethford 1995). In addition, just 
because a gene genealogy appears to have a root that coincides with an 
African allele does not mean that modern humans came from an expansion of 
the African population to cover the earth, it only means that the one gene has 
the observed ancestral history. Other genes may trace back to other origins. 

Inferences about human origins from extant patterns of genetic variation 
require an understanding of nonequilibrium models, where populations 
grow in size, new colonies are founded, and populations remained connect- 
ed by some level of migration. Recently there has been much attention paid 
to the influence of past changes in population size on patterns of variation. It 
was observed that a growing population produces a gene genealogy that has 
a more starlike shape than does a stationary population, and this in turn pro- 
duces a peak in the distribution of pairwise counts of mismatches (Slatkin 
and Hudson 1991; Rogers and Harpending 1992). The use of patterns of 
human genetic variation to make inferences about our ancestral history is an 
active and lively area of inquiry. 

Chloroplast DNA and Organelle Transmission in Plants 

Chloroplasts are cellular organelles that also have their own genome and 
also are transmitted in a non-Mendelian fashion. Chloroplast DNA (cpDNA) 
ranges in size from 135 to 160 kb, and it occurs in multiple copies in each 
chloroplast. Its structural organization is conserved in higher plants, and the 
rate of synonymous nucleotide substitution is approximately 1 x 10 y substi- 
tutions per site per year. Thus, the evolution of cpDNA is conservative in 

366 Chapter 8 



Rate of nucleotide 

Rate of structural 

Angiosperm cp DNA 
Angiosperm mtDNA 
Mammalian mtDNA 
Fungal mtDNA 






regard to both sequence and structure (Table 8.3). The opposite extreme, with 
a very fast rate of evolution, is found in the mtDNA of fungi, which changes 
rapidly in both sequence and structure. 

The mtDNA of angiosperm plants has the opposite pattern of evolution as 
found in animal mtDNA. In sequence evolution the rate in angiosperms is 
slow, but in structural evolution it is fast. In plants, the mtDNA genome is 
large and highly complex. In some instances, a single molecule can resolve 
itself into smaller circles and even linear molecules. For example, in the 
turnip (Brassica campestris), a 218 kb molecule undergoes an internal recom- 
bination event that produces smaller circles of 135 kb and 85 kb. Maize 
mtDNA contains six pairs of repeated sequences that can undergo recombi- 
nation and create a variety of structural derivatives. The Ambidopsis mtDNA 
genome was recently sequenced, and although it is 366 kb, nearly all the 
increase in size compared to mammalian mtDNA is noncoding (Unseld et al. 
1996). Many plant mitochondria also contain autonomously replicating plas- 
mid DNA molecules, and mtDNA is also capable of incorporating segments 
of cpDNA. Why plant mtDNA genomes are so large, complex, and variable 
in size is not understood. 

Maintenance of Variation in Organelle Genomes 

Organelle genomes have unusual population genetics because of their (typ- 
ically) uniparental transmission and because many copies are passed from 
the mother to the progeny through the egg. Uniparental transmission has 
important implications in the operation of natural selection, since it is 
equivalent to a haploid clonal population structure, and pure selection mod- 
els can maintain polymorphism in such populations only if the fitnesses are 
frequency dependent. From the outset, then, uniparental transmission 
makes it less likely for polymorphisms to be maintained by natural selec- 
tion, even if epistatic effects with the nuclear genome are allowed (Clark 
1984). The widespread polymorphisms observed in mlDNA must then be 
atlributed largely to high mutation rates, just as the rapid substitution rate 

Molecular Population Genetics 367 

was attributed to a high mutalion rate. Polymorphisms can also be main- 
tained by interspecific hybridization, and itis possible to obtain estimates 
of rates and directions of interspecific ma tings from nuclear and mtDNA 
data (Asmussen et al. 1987). Unusual forms of transmission, such as the 
doubly uniparental transmission of the mussel Mytdus edulis, results in sep- 
arate male and female lineages, which are highly divergent (Skibmski et al 
1994, Stewart etal. 1995). 

The theory of random genetic drift for organelles is more complex than 
that for nuclear genes because individual cells have many organelles that are 
apportioned among daughter cells; thus there is an additional level of sam- 
pling when heteroplasmic cells divide. Models of the dual sampling process 
have been examined in some detail (Birky et al. 1983; Takahata 1983, 1984) 
These models predict some level of heteroplasmy, and although early empir- 
ical studies did not detect heteroplasmy, it has now been described in crick- 
ets (Harrison et al. 1987), Drosophila (Hale and Singh 1986; Solignac et al 
1983, 1984, 1987), lizards (Densmore et al 1985), mice (Boursot et al. 1987), 
cattle (Hauswirth and Laipis 1982), frogs (Monnorot et al. 1984), treefrogs, 
and bowlin fish (Bermingham etaJ. 1986). Heteroplasmy can be maintained 
by a steady-state balance between the forces of random genetic drift and 
mutation, but heteroplasmy is most frequently observed in restriction length 
polymorphisms, in which variants differ in the number of copies of a small 
repeat. Simple deterministic models show that heteroplasmy can be stably 
maintained by infrequent paternal transmission (leakage), by natural selec- 
tion, or by bi-directional mutation, such as the gain/loss events one would 
expect for changes in copy number of a small repeat (Clark 1988). Distribu- 
tions of heteroplasmy in the field cricket are consistent with a model of 
mutation-selection balance, with smaller genomes favored by selection (Rand 
and Harrison 1986) 

Evidence for Selection in mtDNA 

There are several clear examples of nonneutralily of mtDNA mutations. For 
example, many forms of cytoplasmic male sterility are caused by defects in 
mtDNA (Grun 1976; Levings 1983). Similarly, cytoplasm ically transmitted 
drug resistance genes have been shown to be associated with the mitochon- 
drial genome of yeast. The potential importance of mtDNA variation in 
human health was revealed in the implication of mitochondrial DNA defects 
in the muscle diseases known as mitochondrial myopathies The celebrated 
bicycle racer Greg Lemond, a three-time winner of the Tour de France, was 
forced into early retirement by a defect in mitochondria! oxidative metabo- 
lism. Effects of natural selection also have left their mark on cxtanl patterns 
of mtDNA sequence variation, as revealed by [he discordance between 
levels of polymorphism and divergence in synonymous vs. nonsynonymous 
sites (Ballard and Kreitman 1994, Rand and Kann 1996) The strongly 

368 Chapters 

skewed distribution of frequencies of segregating sites also suggests that 
human mtDNA has faced selection pressure (Hey 1997). 

If a cytoplasmically related factor of any sort is associated with a particular 
mtDNA type, then the mtDNA will "hitchhike" along with the other cytoplas- 
mic factor. A striking example of this mode of evolution in action was caught by 
Turelli et al. (1992), when they noticed that a cytoplasmically transmitted Wol- 
bachia infection in Drosophila shmiltms was rapidly spreading north in California, 
and as it did so, it propelled a single mtDNA type to high frequency. While the 
mtDNA genome may seem small, its uniparental transmission makes it suscep- 
tible to any cytoplasmic factor that may carry a particular cytoplasmic type to 
fixation. However, most populations have fairly high levels of mtDNA variation, 
suggesting that such sweep events are not very common. 


The use of techniques of molecular biology particularly those for determin- 
ing amino acid or nucleotide sequences, has added a new dimension to phy- 
logenetic inference. For example, the analysis of 5S RNA sequences in a 
broad variety of microorganisms has led to a reclassification at the deepest 
of phylogenetic levels, resulting in a new kingdom, the Archaea (Woese 
1981). In addition to the satisfaction of understanding the history of rela- 
tionships among living things, the application of comparative molecular 
analysis to infer robust and accurate phylogenetic relationships has spawned 
interest in the application of those phylogenetic trees for testing hypotheses 
about evolutionary mechanisms The problem of inferring the correct 
branching topology for a tree that relates a set of organisms is a challenge in 
part because oi the enormous number of possible bifurcating trees. If there 
are n species to be placed, there are {In - 3)'/2"~ 2 (it - 2)! rooted trees that 
describe possible ancestral histories. For five species this number is 105, and 
for 10 species it is 34,459,425. For many data sets of 30 or more species, the 
number of possible trees is so enormous that it is not possible to examine all 
topologies and assess the fit of the data to each tree, even with the very 
fastest computers. Fortunately, the trees are not all independent of one 
another, and the key to many of the algorithms that try to find the best fit- 
ting tree is to eliminate whole classes of trees based on the observed data. Let 
us consider a few of these tree-building methods. 

Algorithms for Phylogenetic Tree Reconstruction 

If a gene in a pair of species or populations evolves in clocklike fashion, and 
if the degree of divergence between two genes implies that they have been 
diverging for f generations, then we can infer that the genes separated from 
a common ancestor f /2 generations ago. This reasoning provides a group of 
methods of tree construction based on measures of genetic distance. One 
such method is the unweighted pair-group method with arithmetic mean 

Molecular Population Cenetic-s 369 

(UPGMA) or average distance method This method requires that all 
sequences evolve at the same rate, an assumption that other methods can 
relax to some degree, but the ease of understanding UPGMA still gives it 
heuristic appeal With a matrix of all pairwise distances, a tree is built up by 
first grouping the two species with the smallest distance. A new distance 
matrix is then constructed, with the grouped species now considered as one 
unit. If the grouped species were indexed i and /', then for all k * i,j the dis- 
tance from * to the group \i,jj\ is d m = V 2 {d lk + rf r; ). In words, the distance from 
each other species k to the group (»,/) is the average of the distances from 
species k to each of species i and /" in the group. The new distance matrix is 
again searched for the smallest element, and the appropriate grouping again 
occurs. This process is repeated until all species are clustered into a tree. 

Tree-building methods can not only produce a tree topology, but they 
generally also give estimates of branch lengths of the tree An example of one 
method for branch-length estimation is the method of Fitch and Margoliash 
(1967). Suppose the number of substitutions distinguishing sequence / and ;' 
is d„. If the tree relating sequences 1, 2, and 3 has branch lengths A, B, and C 
(Figure 8.25), then the branch lengths can be estimated from 

C=y 2 (rf 13 + d 23 -d l2 ) 


These relations were found by solving the equations d u = A + B, d ]3 = A+C r 
and d^ = B + C. With more than three sequences, the tree is built up by con- 
sidering three units at a time, beginning with the two most closely related 
sequences and grouping the remaining sequences If sequences 1 and 2 are 
the most similar, then the distances from sequence 1 to the remaining group 
is the average of the distances from sequence 1 to each member of the group. 




Species 1 

Sprat's 2 

Species 3 

Figure 8.25 A simple phylogenetic tree. A, B, and C represent branch lengths 
horn the most recent common ancestor. 

370 Chapter 8 

In this way, only three distances are considered at a time, and Equations 8.17 
allow branch lengths to be estimated. This method is known as least squares, 
and it turns out that Equations 817 minimize the sum of squared deviations 
trom the model, much like linear regression. 

Another algorithm for tree construction is particularly well suited to the 
situation in which one does not know whether rates of substitution are con- 
stant across clades of the tree. This method is known as neighbor-joining, 
because it groups species having the property "neighbors" (Saitou and Nei 
1987). Begin by assuming that the sequences are all related to one another by 
a star phylogeny (Figure 8.26). For a star phytogeny with N sequences, the 
sum of the branch lengths is 

(It may help to draw a star phylogeny to see that each branch gets counted 
N - 1 times.) Next we begin a procedure (hat groups certain sequences 
together For each possible pair of sequences, a tree like that in step 1 in 
Figure 8 26 is constructed. Branch lengths for this tree are estimated by least 
squares, and the sum of the branch lengths for the entire tree (S (/ ) is calculat- 
ed. We consider as neighbors that pair of sequences i and j that give the min- 
imum of the 5,/s. After the first pair of neighbors is found, that pair is con- 
sidered as a single entity (joined neighbors), and the process of considering 
all possible pairings is repeated. The distance from any one sequence k 
to this pair of neighbors (/ and /") is the average of the two distances, or 
'/zf^r* + dik)- The process ends when there are just three neighbors left, and at 
this point we have a finished neighbor-joining tree complete with branch 
lengths. The criterion lor neighbor-joining is to minimize the sum of branch 
lengths, and sometimes if is possible to find tree topologies that are even short- 
er, using a method called minimum evolution trees (Rzhetsky and Nei 1992). 

PROBLEM 8. 10 Consider a sample of one allele drawn from each of 
three species. Suppose that the tree that one gets from these alleles 
may be represented ((A,B),Q, implying that A and B are most closely 
related, and C Is the outgroup- What are the possible relationships 
among the species bearing these alleles? 

ANSWER This problem bears on an important Issue in phylogeny 
reconstruction, namely that any one gene tree does not necessarily 
reflect the true pattern of splitting of species. The easiest way to see 

Molecular Population Genetics 371 



D E 




1.02 82 



93 71 


65 (181 






Figure 8.26 Illustration of the neighbor-joining method for phylogeny recon- 
struction. Given a distance matrix, one starts with a star phylogeny and tests all 
trees having different pairs separated from the rest. The tree with A-B joined is 
the shortest such tree. The process of testing all pairs of "neighbors", where a 
neighbor may be either a single allele or a cluster of alleles, is repeated until no 
more joining can be done. (See Saitou and Nei 1987.) 

this is to consider ancestral populations as being polymorphic, in 
which case the speriation process may sort out the alleles in various 
ways. It turns out that the possible species trees include ((A,B),C), 
{(A,C),B), and ((B,C),A). In other words, the gene tree does not elimi- 
nate the possibility of any of the species trees. 

372 Chapter 8 

Distance Methods versus Parsimony 

There is no universal theory provides a single optimal way to construct 
phylogrneiic trees, and as basic as the distance matrix seems, if is not 
required by all methods. Another method, known as maximum parsimony, 
uses the smallest number of mutational events necessary to account for the 
evolution of a set of sequences from a common ancestor to construct the 
trees. There are a number of such parsimony methods based on trees with 
the smallest number of substitutions, but none guarantee that the most par- 
simonious tree is the correct tree. For example, when rates of substitution dif- 
fer in different branches of the tree, the parsimony method often fails to give 
the correct topology (Felsenstein 1978). Methods for constructing phyloge- 
netic trees have been reviewed by Felsenstein (1981, 1982) and more recent- 
ly by Nei (1996). Massive simulation studies have been done to test the sta- 
tistical reliability of tree-constructing methods (Rohlf and Wooten 1988; 
Sourdis and Nei 1988; Hillis 1996). Results of these simulations are easy to 
summarize: if the data allow one method to assign a topology with good sta- 
tistical confidence, generally all the popular methods work pretty well. But 
if the data have many apparent reverse mutations, variable rales among 
branches, or wide variation in rates across sites, then none ot the methods 
works very well. 

Bootstrapping and Statistical Confidence in a Tree 

Because there are so many possible tree topologies, it is important to assess 
how much statistical confidence one can place in a particular tree. One can- 
not assign a numerical standard error to a tree; by its geometrical nature a 
tree is actually a complicated statement of phylogenetic relationships, such 
that we might have high confidence in some branches, and low confidence 
in others. A widely used method of assessing confidence in the nodes of a 
tree is the bootstrap test (Felsenstein 1985). The basic idea is quite simple: a 
subset ol the original data is drawn with replacement and, from this new 
data set, a tree is drawn. For each node in the original tree, we ask whether 
the new tree has the same cluster of sequences. The whole operation of 
resampling the data, drawing a tree, and tallying up nodes that are in the 
original tree is repeated perhaps 1000 times, and the final result is displayed 
graphically as a numbpr next to each node indicating the percentage of time 
that cluster is present among the resarnpled trees. If that fraction is high, 
then one gains confidence that the given cluster actually belongs together. 

Another means of testing the statistical confidence in a tree is to test the 
null hypothesis that each interior branch has length zero. From distance 
methods, we often obtain estimates of all branch lengths in the tree, along 
with their standard errors. If we fail to reject the null hypothesis of zero 
length for an interior branch, then we lose confidence in the nodes surround- 
ing that branch. 

Molecular Population Genetics 373 

Shared Polymorphism 

One might intuitively expect that all the alleles of a species should cluster 
together on a gene tree, implying that the common ancestor of all the alleles is 
an ancestral allele within the same species. A few gene trees have been found 
to have (he unexpected property that alleles in two or more species appear to 
be interdigilated on the tree. This pattern, known as shared! polymorphism or 
trans-species polymorphism, has been observed in major histocompatibility 
alleles in primates (Lawlor et al. 1988), in self-incompatibility alleles of plants 
(loerger et al. 1991), and in several genes in the nwlanogastcr species subgroup 
of Dwsophiln (Hey and Kliman 1993). Figure 8.27 shows the probable means by 
which shared polymorphism arises, namely, that the ancestral species was 
polymorphic, and two or more alleles remain in the descendant species ever 
since the time of the common ancestor. Recall that the expected fixation time 
for a new mutation, given that it goes to fixation, is 4/V generations. This 
means that neutral alleles are quite unlikely to remain polymorphic for much 
longer periods. Consequently, observation of shared polymorphism implies 
that either strong selection is retaining the alleles in the population, or that the 
species have diverged relatively recently. In the first two examples above, there 
is good evidence that selection has maintained the polymorphisms, while in 

Shared polymorphism 

Figure 8.27 Trans-species or shared polymorphism may occur if the ancestor 
was polymorphic for two or more alleles and if alleles persist to the present in 
both species. 

374 Chapters 

the third example, the Drosophila species are recently enough diverged that 
some shared neutral polymorphisms are expected. 

Interspecific Genetics 

Phylogenelic inference from molecular sequences is a descriptive goal in the 
sense that the primary objective is to obtain an accurate representation of the 
ancestral history of the species. Population genetics can also address the 
genetic basis for species differences, particularly in the case of species in 
which some hybrids are at leasl partially fertile. Although these studies do 
not directly address the genetic causes for species origination, they are rele- 
vant to the genetic causes of barriers to interspecific gene flow. Investigation 
of the genetic basis for hybrid infertility and inviability among species in the 
Drosoplitla tiuitmogaster species subgroup (comprising the species melan- 
ognster, simulant, secheUia, and mmmliana) is a very active area. One focus in 
this work has been an investigation of the genetic basis for Haldane's rule, 
which states that, in interspecific hybrids in which only one sex is sterile or 
inviable, the sex likely to be affected is the heterogametic sex (Coyne 1985; 
Coyne el al. 1991). Rather than one or two genes of large effect, interspecific 
hybrid sterility appears to be caused by many genes that also have a complex 
pattern of interaction, so that some particular combinations are sterile and 
other combinations arc fertile (Palopoli and Wu 1994). A powerful tool for 
studying the genetic basis of hybrid sterility has been to introgress small 
pieces of the genome from one species into the other. By doing this for many 
regions distributed all over the genome, one can learn about the relative roles 
of the X chromosome and autosomes, the relative incidence of male vs. female 
infertility, and so forth (True et al. 1996). Other features of interspecific differ- 
ences are amenable to genetic analysis by either introgression methods 
(applied to differences in cuticular hydrocarbons by Coyne 1996) or by scor- 
ing an array of anonymous markers in many backcross individuals (applied 
to genital arch morphology by Liu et al, 1996). 


Genes increase in number through duplication. Several successive rounds of 
duplication result in a family of homologous genes with related functions, a 
multigerte family, the members of which are often arrayed in tandem along 
the chromosome. Among genes that normally exist in tandemly arrayed 
multigene families are the rRNA genes and the histone genes. Analysis of the 
sequences of members of multigene families has led to some interesting sur- 
prises. Figure 8.28 shows a scenario whereby a gene underwent a duplica- 
tion that ultimately became tixed in the population either through drift or 
selection. Subsequently, sufficient sequence divergence occurred that the two 
genes could be distinguished. Later a speciation event produced two differ- 

Molecular Population Genetics 375 

/ X^ Duplication 

| Divergence 

Time 1 

Time 2 

Figure 8.28 Multigene families originate by a process of gene duplication 
After the duplication the genes may retain very similar functions (like rRNA 
genes), or they may diverge (like globin genes). Tf the species splits into two 
species, then time 1 and time 2 depict the relationship between the genes short- 
ly after speciation and long after speciation (see Figure 8.29). 

ent species sharing this pair of genes. Figure 8 29 shows the gene genealogies 
at two time points in the evolution of this gene family At time 1, the A genes 
in species 1 and 2 have a more recent common ancestor than do genes A and 
B within species 1. At time 2 the pairs of genes present in the same species 
are more similar. This is the pattern that is observed in some multigene fam- 
ilies The close resemblance of Aj with B h and of A 2 with B 2 , seems paradox- 
ical, since both species have the duplication, and Figure 8.28 makes it appear 
that genes A x and A 2 in the two species have a more recent common ances- 
tor than do genes A } and B v Genes A, and B,, as well as A 2 and B 2 , may have 
more similar sequences because the genes evolve together, in concert, under 
the influence of mechanisms that operate to homogenize their sequences. 
This tendency toward homogenization is known as concerted evolution. 

Causes of Concerted Evolution 

Two important mechanisms of concerted evolution are gene conversion and 
unequal crossing-over Gene conversion is a process in which nucleotide pair- 
ing between two sufficiently similar genes is accompanied by the excision of 

S76 Chapter 8 

Time 2 

Figure 8.29 Referring to Figure 8.28, ai Time 1, genes ,4, and A 2 in the two 
species are more similar to each other than either is to gene B, and likewise B, 
and B z are closest neighbors This tret? reflects the fact that the common ancestor 
of A i and A 2 is more recent than that of A , and B,. If at Time 2 a tree like the bot- 
tom panel is observed, then sequences of A x and B, have become more similar, 
possibly by the process of gene conversion The bottom tree illustrates the phe- 
nomenon known as concerted evolution. 

all or part of the nucleotide sequence of one gene and its replacement by a 
replica of the nucleotide sequence from the other gene. Formally, the result 
is that the sequence in one gene "converts" the sequence in the other gene to 
be exactly like itself. In unequal crossing-over, meiotic pairing between the 
tandem repeats in homologous chromosomes is out of register, and crossing- 
over results in an increase in the number of copies in one chromosome and a 
corresponding decrease in the number of copies in the other chromosome. 
Repeated rounds of unequal crossing-over can result in the disproportionate 
representation of certain sequences among members of the multigene fami- 
ly, a result that is formally equivalent to gene conversion. 

A theoretical model of concerted evolution has been studied by Ohta 
(1982). In this model, a tandemly arranged multigene family consists of a 
fixed number of n members, and X is the probability that a particular member 
of the gene family becomes converted by another member in any one gener- 

Molecular Population Genetics 377 

ahon (Equivalently, X is the probability of completion of a cvcle of unequal 
crossing-over resulting in the replacement of one sequence in the tamilv by 
another) The mutation rate per copy is u, and the population number is N. 

In a tandemly arrayed multigene family, there are three distinct types of 
identity by descent (IBD) among the gene copies (Figure 8 30): 

1 . Genes at different positions in the same chromosome may be IBD 
(probability Cj). 

2. Genes at different positions in different chromosomes may be [BD 
(probability c 2 ). 

3. Genes at the same position m different chromosomes may be IBD 

Complex formulas for the equilibrium values of C\, c 2 , and /"have been 
derived by Ohta (1982), but they are greatly simplified when recombination 
within the gene cluster is ignored. In such a case, the equilibrium values are 

C\ = c 2 = 

f = 

A + (n-l)u 

4N\c 2 +l 
4NX 4 4A/M + 1 


In Equations 8.18, the quantity (n - l)u is very nearly equal tonu if n is 
reasonably large. Because it is the number of copies of the gene in each tan- 
dem array, nu is the total rate of mutation in the multigene family, summed 
across all copies, Thus, the implication of Equation 8.18 is that there is a deli- 
cate balance between the rate of gene conversion X and the total mutation 
rate nu. If the rate of gene conversion is much greater than the total mutation 
rate, then the probability of IBD of genes at different positions within the 

<* 1- 1 "t 


I I I I 

1 s 



e i ' i l 


1 1 1 1 

1 t 

Figure 8.30 Three types of identity by descent in multigene families. They are 
the identity between genes at homologous sites (probability f), between genes nt 
nonhomologous sites in the same chromosome (probability c t ), and between 
gene at nonhomologous sites in different chromosomes (probably c 2 ) (After 
Ohta 1982.) 

378 Chapter 8 

family ff| and c 2 ) is close to 1.0. On Ihe other hand, if X is much smaller than 
the total mutation rate, then the probability of IBD of genes at different posi- 
tions within the family is close to zero. 

Concerted evolution does not homogenize all multigene families. 
Depending on the balance of the forces of mutation, gene conversion, and 
unequal crossing -over, the pair of genes may remain active and very similar, 
or they may diverge in function (such as different tissue-specific forms of 
amylase or lactate dehydrogenase), or one gene may lose function and 
become a pseudogene Multigene families can avoid the accumulation of 
mutations when there is sufficiently strong natural selection, and positive 
selection is necessary for genes to evolve new functions. Walsh (1988) 
addressed the question of genes within a family escaping from gene conver- 
sion, and he showed that higher mutation rates and lower conversion rates 
lead to greater likelihood for a gene escaping conversion. Once a gene is suf- 
ficiently divergent to have escaped conversion, it can either lose function 
and become a pseudogene or it can acquire a new function. Simple models of 
such a duplicated gene show that very little selection is needed in a large 
population to avoid a pseudogene fate (Walsh 1995). 

Multigene Family Evolution through a Birth and Death Process 

Duplicate genes can evolve in separate ways under the influence of natural 
selection, mutation, and random genetic drift. In time, some members of a 
multigene family may diverge to a greater or lesser degree in their function 
This process of duplication and divergence is thought to be the major mech - 
anism by which genes with novel functions are created. Some multigene 
families retain a tandemly arrayed structure and similarity in function across 
members despite the fact that the differences between individual members 
is of functional significance. This pattern is particularly true of genes in the 
immune system, including immunoglobulin genes and major histocompati- 
bility genes Interspecific comparisons of genes in families of this sort exhibit 
some genes that are clearly homologous, and others thai are more distantly 
related. In addition, the rate of duplication, loss of function through pseudo- 
genes, and loss by deletion, may be fairly high. This kind of pattern of multi- 
gene family evolution is different from concerted evolution, because the dif- 
ferences between the genes can be high enough that intergenic conversion is 
very rare. Figure 8 31 illustrates the distinctness of this pattern of gene evo- 
lution, called a birth-nnd-death process by Ota and Nei (1994). 

Figure 8.32 illustrates the result of duplication and divergence in two 
related multigene families in mammals that code for the a-like and p-like 
polypeptide chains of hemoglobin The genes are specialized for different 
periods of life. The e (epsilon) genes are expressed in embryos; the 'y and y 
genes and the a. genes in the fetus; and the a, p, and 6 genes in Ihe adult. The 
inference from differences in nucleotide sequence is that the original 

(A) Concerted 


(B) Divergent 

Molecular Population Genetics 379 

(C) Evolution 
by birth and 
death process 

Species 1 Species 2 Species 1 Species 2 Species I Species 2 

Figure 8.31 In addition to concerted evolution and simple divergent evolu- 
tion, multigene families frequently exhibit the phenomenon of genes being 
added and lost to families by a "birth and death process." (From Ota and Nei 

a-p duplication took place approximately 500 million years ago, when ver- 
tebrates were represented by the bony fishes, and the p-y duplication took 
place about 80 million years ago, during the mammalian radiation. More 
recent duplications have also occurred, for example those leading to the two 
functional a genes, the cluster of three ot-Iike pseudogenes, and the two y 
genes. There are several models for the sequence of duplication, deletion, 
and conversion events that could have led to the current array of globin 
genes (Goodman et al. 1984; Hardies et al. 1984; Hardison 1984; Margot et al. 
1988), but it appears well substantiated that the ancestral cluster that pre- 
dated the mammalian radiation was 5'-eyr|8p -3' Within the mammalian 
radiation, the different orders of mammals evolved along different routes. In 
prosimian primates, such as lemurs, there was a fusion ol r| and 8. In higher 
primates, including humans, there was a §-p conversion and a y duplication. 
In rodents, p and y both duplicated, q was deleted, and there was a 8-P 
fusion, mediated probably by an unequal crossover. In rabbits, r| was delet- 
ed and there was a 8-P conversion. Finally, in goats, y was deleted, there 
was a 5-P conversion, and the remaining four gene array was then tripli- 

The evolutionary history of the fetal globin genes in humans reveals that 
the G y and A y genes originated as part of a relatively recent 5 kb tandem 
duplication (Shen et al. 1981). Furthermore, evidence from nucleotide 


380 Chapter 8 

Mammalian ancestor — [~J I ' 
Eutherian ancestor 

f 7 n * P 

e y 5 p 5 P 

E v ph() phi ph2 ph3 (Jl P2 

\ / W , 

tmbryo Fetal and 

Mouse __ 


e 1 f" yp x p c e ii[ e ,v yp 7 - P A e v t vl V[1 Y P 1 


Embryo juvenile Adult 



f Gy Ay v|>T| o p 

I \ / \ I 

Embryo Fetal Adult 


Figure 832 Reconstruction of the (J-globin sequences in a series of mammals 
illustrates the complexity of duplication, loss, and gene conversion in this multi- 
gene family. (After Hardison 1984 ) 

sequences strongly suggests that a gene conversion event also occurred, 
which converted part of one particular A y allele into a G y allele (Slightom et 
al. 1980). The converted A y allele is very similar to a G y allele for about 1550 
bp on the upstream (5') side of a putative recognition signal for gene conver- 
sion {a stretch ol repeating TG and CG dinucleotides); but on the down- 
stream (3') side of the putative signal, the converted A y allele is typical of 
other A y alleles in the human population. The A y to c 'y gene conversion 
occurred much more recently than the duplication resulting in the close 
sequence similarity of the A y and °y genes. 

The estimate of the time of occurrence of the A y- G y duplication can be 
improved by using the nucleotide sequence data from the entire duplicated 
5 kb region. In the entire region, 14% of the nucleotide sites differ, which 
translates into k = 0.155 ± 0.006; this suggests a time for the duplication of 
0.155 x 100 x 2.2 x 10 ft = 34 million years (Shen et al. 1981). 

Molecular Population Genetics 381 

Unequal crossing-over in multigene families can result in a decrease in 
the number of genes as well as an increase It is therefore not surprising that 
deletions of one or more of the hemoglobin genes are found in most parts of 
the world. Although usually very rare, in a few places the frequency of 
the delelions reaches levels too great to be accounted (or by chance, espe- 
cially in view of the observation that the carriers are mildly to severely ane- 
mic. Although a deletion of the p-gene results in death when homozygous, 
a p deletion and other mutations that decrease the abundance of the 
p-hemoglobin chain are relatively common in the Mediterranean Sea basin 
where malaria is endemic. For this reason, the decreased -p-chain diseases 
are called ^-thalassemias (literally translated as "sea-anemias")- The well- 
established link between sickle-cell anemia and malaria, along with the geo- 
graphical correlation between the P-thalassemias and malaria, provides a 
strong circumstantial case for malarial parasites being an important selective 
agent Deletion of one or more of the a-globin genes results in another form 
of anemia called a-thalassemia, whose frequency in populations is also cor- 
related with the incidence of malaria. 

Red-green colorblindness is a common X-linked disorder with a frequen- 
cy of about 8% in Caucasian males. The genes for the red and green visual 
pigments match at 98% of their nucleotides, indicating that they arose by a 
relatively recent duplication. Individuals with normal color vision have one 
copy of the red pigment gene and varying numbers of copies of the green 
pigment gene. When genomic DNA from colorblind males was analyzed by 
Southern blotting, those defective in green vision were lacking fragments of 
the green pigment gene. Further analysis showed that 24 of 25 colorblind 
individuals had lost one or the other pigment gene through gene rearrange- 
ments that were due either to unequal crossing-over or gene conversion. In 
this example, the high sequence similarity of the red and green pigments 
works to human disadvantage by greatly increasing the likelihood of 
exchange events that lead to loss of color vision (Nathans et al 1986). The 
relationship between the molecular basis of light absorption and perception 
was made particularly clear when it was found that a normal polymorphism 
in red pigments, which confers a difference in the absorption peak of the pro- 
tein product, also confers a measurable difference in the perception of color 
balance (Merbs and Nathans 1992). 

Duplication of genes also occurs in plants, including a particularly impor- 
tant gene in plants that encodes the carbon fixing enzyme ribulose-l,5-bis- 
phosphate carboxylase (RBC) (Clegg et al 1997) The functional RBC 
holoenzyme consists of eight large and eight small subunits. Early in plant 
evolution, both the large and small subunits of RBC were encoded by the 
chloroplast genome, but the small subunit gene was transferred to the 
nuclear genome at an early stage and has now been lost from the chloroplast 
genome. Diploid angiosperms contain from two to eight copies of the gene 

382 Chapter 8 

for the small RBC sublimit (rbcS). All copies of rbcS appear h> be functionally 
equivalent, and sequence analysis shows that the genes thai are closest 
together in the genome are also generally more similar in sequence In 
sequence comparisons among rbcS genes of tobacco and tomato, homologous 
genes compared between the two species are more similar than within 
species comparisons of gene copies. This finding is not the pattern expected 
under concerted evolution. The variable number of loci across angiosperms 
suggests that gain and loss of gene copies occurs to give a pattern like the 
birth-and-death process described above 

Structural RNA Genes and Compensatory Substitutions 

Transfer RNA and ribosomal RNA molecules derive their biochemical prop- 
erties from the secondary structure into which they fold. We are still learning 
the chemical rules by which such macromolecules attain their final folded 
configuration, but one thing that is very clear is that complementary base 
pairing is important. The stems of tRN As are critical to maintaining the tight- 
ly folded structure of these essential molecules. Substitutions that occur in 
stems will weaken the stability of the stem unless there is a compensatory 
change on the other strand that maintains base pairing. Kimura (1985) real- 
ized that one could obtain evidence for such compensatory changes. More 
recently, such compensatory changes have been demonstrated in an intron, 
demonstrating that the folding structure of introns may also be important to 
regulating gene expression (Kirby et al. 1995). 

Further evidence of the importance of secondary structure of rRNA 
comes from analysis of rRNA pseudogenes in plants (Buckler et al. 1997). 
One attribute of secondary structure is measured as the difference in free 
energy attributable to complementary base pairing in the folded vs. unfold- 
ed state. Computer predictions of the best folding structure of the rRNA 
pseudogenes suggested that the difference in free energy decreases as the 
sequences accumulate substitutions. Tests of randomly permuted sequences 
showed that the functional rRNA sequences are significantly more stable 
than would be obtained by chance, whereas predicted pseudogene RNAs are 
not Some introns have a significantly open secondary structure, such that 
random substitutions in their sequences result in more stable structures 
(Leicht et al. 1995) The reason some introns retain an open structure maybe 
for access to regulatory proteins. This possibility has been indirectly demon- 
strated by showing that stable stems inserted into introns in yeast can disrupt 
normal splicing. 

The ribosomal RNA gene cluster in Drosophila mefanogastcr consists of 
about 200 copies of a repeated unit on both the X and the Y chromosome, 
with each repeated unit containing an 18S and a 28S rRNA gene separated by 
an intoTgenic sequence (IGS) (Glover and Hogness 1977). The rRNA genes 
provide a clear example of concerted evolution because of great interspecific 

Molecular Population Genetics 383 

differences in spite of a high degree of sequence conservation within species 
(Coen et al 1982). Furthermore, within individuals of D. mcrcnlontm, there 
appears to be little sequence variation, yet there are clear differences between 
individuals due to length variation in the intergenic sequence (Williams et al 
1985). This finding suggests the operation of a strong homogenizing force 
maintaining sequence fidelity within individuals. In humans, the rDNA 
repeat consists of a 13 kb transcribed portion and a 31 kb spacer (Wellauer 
and Dawid 1979). This repeated unit is present in about 300 copies located 
near the tips of the short arms of five nonhomologous chromosomes. Despite 
the dispersed locations, concerted evolution still occurs as evidenced by 
much less variation among sequences within an individual than among 
species. Interchromosomal exchange events would lead to conservation of 
sequence distal to the rDNA cluster on each chromosome, and evidence for 
this conservation has been found (Worton et al 1988). 

Muitigene Superfamities 

In some cases, several sets of muitigene families and single-copy genes may 
share recognizable homology, implying a common ancestry, but they have 
undergone major divergence in function and relocation of position within 
the genome These sets of historically related but functionally distinct genes 
constitute a muitigene superfamily. 

The remarkable similarities found among portions of genes in related 
gene families has suggested that many proteins have functional modules that 
can be combined in various ways in what is called exon shuffling. One 
example of shuffling is found in tissue plasminogen activator (TPA), which 
has portions of three other proteins, including plasminogen, epidermal 
growth factor, and fibronectin. The striking finding is that the junctions of 
these protein segments fall precisely at intron-exon junctions. The epidermal 
growth factor shares exon similarity with several other proteins, including 
blood clotting factors IX and X, urokinase, and complement C9 (Doolitlle 
1985). The gene for the low-density lipoprotein (LDL) receptor in human 
beings extends over 45 kilobases and contains 18 exons that show similarity 
to a bewildering variety of other proteins, including epidermal growth factor 
and blood clotting factors (Sudhof et al. 1985). Just as a computer program- 
mer recognizes the value of reusing subroutine modules in different pro- 
grams, nature has capitalized on the efficiency of modular gene organization. 

One extensively studied muitigene superfamily that serves diverse func- 
tions in immunity is illustrated in Figure 8.33 (Hood 1985; Hunkapiller and 
Hood 1986). The primordial single-copy gene may have coded for a cell-sur- 
face receptor containing the basic homology unit of the superfamily, which is 
about 110 amino acids in length with a strategically placed disulfide bridge 
and folding characteristics enabling it to combine with other similar units. 
An early duplication and divergence of the primordial gene resulted in the 


384 Chapter 8 

CD4 OX-2 (. I"W 
(T.TT4) (1 5*2.1) 

X~2 ^ & 5?-^: T7 

11, 1\ Thy1 N<AM NC3 pcily-1 s Oncoprotein Ifravy I.ij>!i1 (5 « 

Immunoglobulins- rtioptor* 

Multi^rnp families 

Figure 8.33 Proposed evolution of the immunoglobulin multigene superfami- 
ly from a primordial gene coding for a cell-surface receptor. Details of the evolu- 
tionary relationships are speculative. The superfarnily has diversified into 12 
single-gene representatives (all of those at the left, plus f^-microglobulin — 
p 2 -m — at the right), and eight multigene families (remaining representatives at 
the right). These include genes for antibodies, T-cell receptors, major histocom- 
patibility antigens, and other functions. The single-gene members include T-cell 
molecules implicated in MHC recognition (CD4 and CD8) and possibly ion 
channel formation (T35, T3e), an immunoglobulin-) ransport protein (poly-Ig), a 
plasma protein («iJ3-gIycoprotein), two molecules restricted to lymphocytes and 
neurons (Thy-1 and OX-2), two brain-specific proteins (N-CAM and NCP3), and 
[^-microglobulin. The multigene families include the heavy (H) and light (k, X) 
components of antibody molecules, the a, 0, and y chains of T-cell receptors, and 
the Class I and Class II molecules from the major histocompatibility complex 
(HLA). (Adapted from Hood etal. 1985 and Hunkapiller and Hood 1986.) 

Molecular Population Genetics 385 

variable (V) and constant (C) domains that have been so versatile in their 
diversification for specialized immune functions. In some members of the 
immunoglobulin superfarnily, shown at the left in Figure 8.33, the functional 
products are usually individual polypeptide chains, sometimes containing 
internal duplications of the primordial folding unit. These products include 
the poly-Ig receptor that mediates the transport of immunoglobulin mole- 
cules across cell membranes. 

In the other main branch of the superfarnily, shown at the right, the func- 
tional products are usually aggregates of polypeptide chains. In this branch, 
there occurred multiple duplications of the V regions and specialization of D 
(diversity) and J (joining) regions during the evolution of the DNA splicing 
mechanism in lymphocytes, which today results in the tremendous diversity 
of antibodies and T-cell receptors. During the formation of heavy-chain anti- 
body genes in the lymphocytes, any one of a large number of DNA 
sequences coding for the variable part of the molecule can become spliced 
with any one of a small number of DNA sequences coding for the constant 
part, with diversity and joining regions incorporated in between. The many 
possible V-D-J-C combinations enables enormous numbers of different pos- 
sible antibodies to be formed, which is increased still further by slight varia- 
tion in the exact positions of the splice junctions. An analogous type of 
splicing process occurs in the formation of antibody light-chain genes and T- 
cell receptor genes. 

In yet another offshoot of the immunoglobulin superfarnily, shown at far 
right in Figure 8.33, the C region underwent duplication and specialization to 
form molecules of the major histocompatibility complex (MHC), which, 
among other functions, are necessary for the T cells of the immune system to 
recognize foreign antigens. Complete sequencing of a 100 kb region of the T- 
cell receptor gene family has revealed a spectacular degree of sequence conser- 
vation between human and mouse (Koop and Hood 1994). The opportunities 
for exceptionally detailed analysis of multigene family evolution have enlarged 
with genomic sequencing methods already producing the complete sequence 
of entire arrays of genes (Rowen etal. 1996). Although many aspects of the 
immunoglobulin superfarnily tree in Figure 8.33 are speculative, the molecules 
are undoubtedly related because comparison of the relevant units gives 15 to 
40% homology at the amino acid level, and at the DNA level each homology 
unit is encoded in a separate exon. The immunoglobulins thus demonstrate the 
immense evolutionary potential of repeated rounds of duplication and diver- 
gence through specialization of function. 

Dispersed Highly Repetitive DNA Sequences 

A second major class of highly repetitive DNA in eukaryotes is not localized 
in clusters of tandemly repeating units, but is dispersed throughout the 

386 Chapter 8 

genome with single-copy sequences. The importance of dispersed repetitive 
elements to the human genome project is made clear by the realization that 
they constitute 35% of our genome (Smit 1996). In vertebrates, this dispersed 
highly repetitive DNA occurs primarily in two categories, denoted SFNEs 
and LINEs (Singer 1982). SINEs (short interspersed elements) are sequences 
typically shorter than 500 base pairs which occur in 10 s or more copies in the 
genome. Like tRNA genes, they contain internal transcriptional start sites 
and are transcribed by RNA polymerase III. LINEs (long interspersed ele- 
ments) are sequences typically greater than 5000 base pairs that occur in 10 4 
or more copies in the genome. They are processed pseudogenes (see below) 
and, when transcribed, are transcribed by RNA polymerase II. Marked dif- 
ferences in the particular array of subfamilies of SINEs and LINEs or both 
are frequently observed among even closely related species (Figure 8.34). The 
mechanisms and possible significance of such massive and rapid changes in 
repetitive DNA in the genome are very obscure. 

One example of SINEs in human DNA is the Alu family, named because 
the sequence contains a characteristic restriction site for the restriction 
enzyme Alu\. The Alu sequence is about 300 nucleotides in length Alu 
sequences are present in approximately one million copies in the human 
genome and constitute approximately (en percent of the total DNA (Smit 
1996). Sequences closely related to Alu are found in other primates, and more 
distantly related sequences occur in rodents and probably in all placental 
mammals. Two randomly chosen human Alu sequences differ, on the aver- 
age, at 15 to 20% of their nucleotide sites, which calculates to a time of diver- 
gence of between 16 7 and 23.3 million years. In the human genome there is 
an Alu element an average of every 3 to 5 kb, but the distribution is not uni- 
form. For example, the ^-tubulin and thymidine kinase gene regions have 
about 10 times the average density of Ahi repeats (Slagel et al. 1987), and Alu 
lepeats show a preference for integrating into oligo-dA runs (Daniels and 
Deininger 1985) 

PROBLEM 8.1 1 The third chromosome of Dwsophila pseudoobscura 
is polymorphic for more than a dozen inversions that result in differ- 
ent gene orders. Polymorphisms of this sort are different from nucle- 
otide site substitutions because they retain some information about 
the order of events. Consider, for example, the sequences A-B-C-D-E 
and C-E-A-D-B. Can you deduce the order of the events that connect 

Molecular Population Genetics 387 



f£h- V . ,-.- ., _ ; -i ••!:-. i. _i.' \n \t .•' ■ -j 

V ■ 

;. . .\ ' > 

-\ 1 


, v-'. " 'w : 

* V 

■ "■ i ' » i i. ;■ 

'•"-■'"..;' ?• * 


. • ■ '> .*. . ■' ■: '- 'A 

2 4 6 8 10 12 

Kilobase pairs 



6 8 



7 S 




Figure 8.34 A dot plot comparison of the human and rabbit sequences span- 
ning 5- and P-globins Each dot represents a small bit of sequence similarity, 
much of the background due solely to chance, and the regions of extended simi- 
larity stand out as diagonal line segments. The scales are in kilokises, and the 
rectangles indicate the location and organization of the globin genes The solid 
arrows show the location of a rabbit LI repeat, and open triangles indicate 
human Alu sequences and rabbit OcC repeats (a rabbit SINE) The major diago- 
nal line indicates that there is noticeable homology retained Ihrough theS-B 
intcrgenic region, and the sequence similarity of human (3-globin with rabbit 5- 
globm (and vice versa) is evident. (From Margot et al. 1988.) 

388 Chapter 8 

ANSWER From A-B-C-D-E, the first inversion must have been the 
segment A-B-C, giving the sequence C-B-A-D-E, Next, the segment 
A-D-E inverted to give C-B-D-A-E. Finally, the segment B-D-A-E 
inverted to give C-E-A-D-B. Much more elaborate problems of infer- 
ence have arisen to determine the ancestral series of inversions and 
number of events needed to go from one gene order to another. Com- 
puter scientists refer to this problem as "sorting by reversals." You can 
see that given any random ordering of integers, a finite number of 
inversions or reversals will put them into the correct order. Motivated 
by the biological problem, an algorithm for finding the minimum 
number of reversals to go from one order to another was recently 
implemented (Bafha and Pevzner 1996). As more genomes are fully 
mapped and sequenced, this is likely to be an area of considerable 
excitement. Ehrlich et al. (1997) recently estimated that the number of 
rearrangements that were required to connect the human and mouse 
genetic maps as about 180. 

An example of LINEs in the human genome is the LI family of sequences 
(also called LINE-1 or Kpn, because of a characteristic restriction site). The LI 
sequences average about 2,000 nucleotides, and the 50,000 copies of the 
sequence in the human genome account for about 4% of the total DNA. As 
with the Alii family, sequences related to LI are found in other mammals, 
including the mouse {Hardies et al. 1986) and the rabbit (Demers et al. 1986). 
Not all insertions of LI sequences are innocuous. Kazazian et al. (1988) found 
two cases of hemophilia A that were caused by de novo insertions of an LI 
sequence into exon 14 of the factor VI II gene, whose function is necessary 
for normal blood coagulation. This insertional mutation event was evidently 
mediated by an RNA intermediate and provides a mechanism for natural 
selection to operate on LJ elements. Another deleterious mutation caused by 
a transposable element in humans was an insertion of an LI sequence into 
the jhi/c oncogene in a human breast cancer (Morse et al. 1988). 

In their molecular organization, LINE sequences strongly resemble a class 
of pseudogenes known as processed pseudogenes. Processed pseudogenes are 
thought to result from the reverse transcription of an RNA molecule into 
DNA, followed by insertion of the DNA into the genome. The reverse tran- 
scription and integration process can be carried out by an enzyme called 
reverse transcriptase, which is coded in the genome of a class of RNA- 
containing viruses called retroviruses In cells infected with retrovirus, the 

Molecular Population Genetics 389 

reverse transcriptase makes n DNA copy of the viral RNA, and another 
enzyme inserts the DNA into the chromosome. When reverse transcription 
and integration happen to a processed RNA molecule, the result is a dispersed 
duplicate copy that is generally transcriptionally inactive due to loss of regu- 
latory sequences. Such a sequence is known as a processed pseudogene. 
Many genes are known to have processed pseudogene counterparts, includ- 
ing the genes for human K-immunoglobulin and (3-tubuIin, rat a-tubulin and 
cytochrome c, and mouse a-globin Not all genes that have been processed 
through an RNA intermediate are pseudogenes. Human phosphoglycerate 
kinase (PGK) occurs as an active X -linked gene, a processed X-hnked pseudo- 
gene, and an autosomal gene with remarkable properties. The norma] PGK-1 
gene contains 11 exons and 10 introns, but the autosomal gene has no introns 
and has remnants of a poly-A tail, strongly implying thai it was reverse tran- 
scribed from an RNA transcript. The intron-free autosomal gene (PGK-2) is 
expressed in human testes (McCarrey and Thomas 1987). 

The processed pseudogene model of dispersed repeated DNA evolution 
is illustrated in Figure 8.35 (Hardies et al. 1986). The functional, transcribed 
copies of the gene family are shown at the top, and the horizontal arrows rep- 
resent gene conversion, which promotes concerted evolution of the function- 
al genes. The gene in the center is a preferred donor for gene conversion 
{biased gem conversion). Emanating from the functional genes are numerous 

L_ Processed 

Mutation, random 
genetic drift, deletion 

Figure 8.35 Model for the evolution of a dispersed highly repetitive family of 
processed pseudogenes. A small number of functional genes (top), which 
undergo concerted evolution by means of gene conversion, are transcribed 
under conditions that favor reverse transcription and integration into numerous 
dispersed chromosomal locations. The resulting nonfunctional genes undergo 
mutation and random genetic drift, and are ultimately eliminated by deletion or 
other mechanisms. (From Hardies et al. 1986.) 

390 Chapter 8 

copies of processed pseudogenes distributed throughout the genome. These 
copies are essentially functionless and undergo sequence divergence pro- 
moted by mutation and random genetic drift, which is offset in part by gene 
conversion and other homogenizing processes among the pseudogenes. 
Eventually the pseudogene sequences are cleared from the genome by dele- 
tion or extreme sequence rearrangement or divergence. 

One implication of the model in Figure 8.35 is that, eventually, a balance 
is reached in which the clearance of old pseudogenes from the genome is 
equaled by the creation and insertion of new ones. In the equilibrium state 
there is a steady turnover among sequences in the family, but the total num- 
ber neither grows nor shrinks. Studies of a dispersed repeated sequence in 
the mouse related to human LI suggest a turnover with a half-life of approx- 
imately two million years. That is, after two million years, half the members 
of the gene family will have been removed and replaced with new ones. 
However, the LI family may evolve more rapidly than is typical. 

The very abundance of pseudogenes implies that many unrelated genes 
may have pseudogenes in the same vicinity, as is the case with Alu sequences 
interspersed in the (3-globin cluster. Some fraction of these linked pseudo- 
genes may alter the level, timing, or tissue distribution of transcription of the 
genes to which they are linked, or they may have subtle effects on chromatin 
structure that affect gene expression. Through any of a diversity of mecha- 
nisms, pseudogene copies of dispersed highly repeated gene families could, 
in principle, have effects on phenotype and thus be subject to the influence of 
natural selection. While true in principle, such effects have not yet been 
demonstrated. To the extent that such effects can safely be ignored, the evo- 
lutionary mechanism of highly dispersed repeated DN A sequences is that of 
selfish DNA, subject to the conflicting forces of neutral mutation/random 
drift and the diverse homogenizing processes of concerted evolution. 


The discipline of molecular population genetics has as its theoretical foun- 
dation the neutral theory, which provides a rich set of testable hypotheses 
about the mechanisms that modify patterns of sequence divergence and 
sequence polymorphism. We saw that underlying models must be specified 
even to do seemingly straightforward things like estimating rates of substi- 
tution. The reason substitution rate estimates are not trivial is that, with 
greater divergence, subsequent mutations may not further increase the 
divergence if the site has already been substituted. From observed counts of 
amino acid or nucleotide differences, we usually want to estimate numbers 
of changes per site. The model for amino acid substitution is not very diffi- 
cult because there are 20 amino acids, but even the simplest nucleotide sub- 
stitution model of Jukes and Cantor is subtle. More complicated models 

Molecular Population Genetics 397 

account for differences in rates of transition and transversion substitutions, 
and it immediately becomes apparent that both the process of mutation and 
of substitution can be of any imagined degree of complexity 

Out of sequence analyses there emerges the pleasing generalization that 
many sequences appear to diverge at an approximately clock-like rate. This 
molecular-clock concept should be interpreted somewhat loosely, because 
rigorous statistical tests have identified significant irregularities in its rate. In 
addition, there are dramatic differences in rate of evolution across genes, 
because the neutral substitution rate differs from one gene to the next Some 
lineages appear to have accelerated or decelerated clock rates, and one cause 
for the variation is a change in generation time (for example, from rodents to 

Synonymous and nonsynonymous substitutions have different effects on 
the protein product, so estimating the rates ol these two kinds of substitu- 
tion independently can be informative about the causes of evolutionary 
change. For example, most genes, like Drosophtia Adfi, have a large excess of 
synonymous changes, an observation that is accounted for by the presumed 
deleterious effect of most amino acid replacements. All synonymous codons 
are not used with equal frequency, and the bias in codon usage implies that 
even synonymous substitutions may not be selectively neutral. The most sen- 
sitive tests for selection make use of comparisons between intraspecific poly- 
morphism and interspecific divergence. Under strict neutrality, these two 
quantities should be related to one another, and departures in either direction 
can be detected through heterogeneity among genes. 

The neutral theory also makes predictions about the shape of gene trees, 
and there has been a great deal of excitement about the possibility of testing 
hypotheses about evolutionary forces based on inferred gene genealogies. 
(Problem 8.8 gives one example,) Gene trees have been used to test hypothe- 
ses about selection, recombination, homogeneity of mutation, and even 
migration The ability to account for the patterns of correlation built up by 
the ancestral history of genes has been a major advance in statistical popula- 
tion genetics. 

Organelle genome evolution occupies an important position in the devel- 
opment of molecular population genetics, in part because of the numerous 
studies of mtDN A and cpDNA variation. Of particular interest and contro- 
versy was the work on human mtDN A variation, which raised many intrigu- 
ing problems about human origins. This work stimulated a huge amount of 
theoretical study concerning the statistical inferences that could be made 
from sample data, including times of common ancestry, inference of past 
demographic histories, and so forth. Several recent studies have shown that 
mtDN A exhibits patterns consistent with the past operation of natural selec- 
tion, in violation of many of these models. 

392 Chapter 8 

Molecular phylogenehcs seeks to reconstruct the ancestral history of 
extant organisms, and shares many analytical procedures with molecular 
population genetics. There are several widely used algorithms for recon- 
structing a tree from sequence data, and we examined in some detail the 
UPGMA method, least- squares, neighbor-joining, and parsimony methods. 
One of the more intriguing patterns of variation to emerge from such inter- 
specific comparisons is that of shared polymorphism, in which two or more 
species share a number of alleles in common. It is unlikely that shared poly- 
morphism would be maintained for long by chance, so it is not surprising 
that cases of shared polymorphism are generally found in genes known to be 
under strong selection or in species that have recently diverged. 

When multiple copies of similar genes exist in the genome, they can 
exchange sequences through unequal recombination and gene conversion. 
Such exchanges can result in concerted evolution, a process whereby genes in 
a multigene family are very similar to one another within a species, even 
though the duplication events that gave rise to the family occurred far in the 
past. Not all multigene families undergo concerted evolution. A more com- 
mon finding is that many multigene families exist as groups of genes with 
related function that have diverged enough in sequence to escape gene con- 
version. In this case, new genes appear by duplication and old ones disap- 
pear by deletion, sometimes preceded by inactivating mutations that 
generate pseudogenes. This birth-and-death process gives rise to complex 
patterns of relationships among genes within gene families. 


1. Suppose that you have sequences of gene A and gene B from each of two 
species. The fraction of sites that differ in gene A is 0.7 and the fraction of 
sites that differ in gene B is 05 Apply the Jukes-Cantor formula to 
obtain the estimate of the number of substitutions per site for each gene. 
Which gene do you think would have a smaller estimate of variance of 
substitution rate? Why? 

2. Suppose you discover a community of deep sea creatures that have very 
unusual DNA that has not four bases but six Adenine and thymine pair, 
and guanine and cytosine pair just like most DNA, but there are also niti- 
dine and liondine, which also pair. You obtain sequences from two of these 
creatures and determine that 20% of the sites mismatch in aligned 
sequences. From this figure, estimate the number of substitutions per site 
that have occurred since the common ancestor of the two species. (Hint: 
You know that the number is higher than 0.20, because back mutations 
could have occurred. Derive an expression like the Jukes-Cantor formula.) 

3. The following is a small portion of the gene coding for 6-phosphogIu- 
conate dehydrogenase in two natural isolates of E. coli. 

Molecular Population Genetics 393 


I ' ' 

Infer the correct translation^ reading frame of the sequences and esti- 

a. the number of amino acid differences/site 
b the number of nucleotide differences/site. 

c. the number of non synonymous substitutions per nonsynonymous site 
(regarding codon sites 1 and 2 as nonsynomyous) 

d. the number of synonymous substitutions per synonymous site 
(regarding codon site 3 as synonymous). 

4. In the human immunodeficiency virus HIV, which causes acquired 
immune deficiency syndrome (AIDS), the rate of nucleotide evolution 
has been estimated at about 0.01 substitutions per synonymous site per 
year. Two viruses isolated in 1983 in Zaire and San Francisco differ in 
approximately one third of their synonymous sites. Estimate the year in 
which the viruses last shared a common ancestor (Data from Li et al., 

5 The data below give the proportion of nucleotide sites that differ in a 
gene in four RNA viruses (Yokoyama et al. 1988). HIV1 and HIV2 are two 
rather distinct types of human immunodeficiency viruses, VISNA is a 
lentivirus, and MMLV is a mouse cancer-causing virus. Estimate the 
number of nucleotide substitutions per site using these data. What do the 
numbers imply about the evolutionary relationships among the viruses? 












6. What inference would you make regarding the selective constraints on a 
region of DNA in which the rate of evolution was 5 x 10~ 1 ' nucleotide sub- 
stitutions per site per year? 

7. What might you infer about the evolutionary forces affecting a coding 
region in which the rate of amino acid replacement was greater than the 
rate of synonymous nucleotide substitution? 

8. Ribsomal RNA forms a complex secondary structure in which many 
regions of the molecules are folded back and undergo base pairing with 
complementary nucleotide sequences elsewhere in the same molecule. 
What pattern of nucleotide sequence evolution might be expected in 
these paired regions? 

9. What is the largest value of rf that makes sense in Equation 8.15 and what 
does it mean? 

394 Chapter 8 

10. If the rate of nucleotide evolution along a lineage is 0.5% per million 
years, what is the rate of substitution per nucleotide per year? What is the 
total rate of divergence of two lineages? 

11. While analyzing the DNA sequences of two copies of a gene, you find 
that there are a total of 34 synonymous substitutions and 16 nonsynony- 
mous substitutions. Using the method of Nei and Gojobori, you find that 
there were 310 synonymous nucleotide sites and 633 nonsynonymous 
sites. If possible, estimate the rates of synonymous and nonsynonymous 
substitution, and interpret the result. 

12. If the effective size of a diploid population is N with respect to autosomal 
genes, what is it with respect to 

a. X-Jinked genes? 

b. Y-Hnked genes? 

c. mtDNA? 

13. Analysis of mtDNA in humpback whales (Baker et al., 1990, Nature 
344:238-240) has shown that not only do the Atlantic and Pacific popula- 
tions show differences, but there are clear geographic subpopulations 
within oceans despite the lack of geographic barriers. Such a pattern may 
be observed if either: (1) there were a low rate of migration and a low rate 
of mtDNA sequence divergence, or (2) a higher rate of migration with a 
higher rate of mtDNA sequence divergence. Can you distinguish these 
two possibilities? Can you separately estimate the rate of neutral muta- 
tion and the rate of migration in a subdivided population? 

14. Suppose the phylogeny of five species is ((A,B)R(C(D,E))|, where R des- 
ignates the root. Can you ascribe the substitution events of the following 
data uniquely to branches on this genealogy? Number the sites 1-10 and 
label the substitutions by site number on the tree. 

Species A TAG CTC ATC A 

Species B TAG CCG AGC A 

Species C TAC CCG ATT G 

Species D TAC CCT ATC A 

Species E TGC CCT ATC A 

15. For an ideal population of effective size N, the average time to loss of a 
new mutation destined to be lost is 21n(2JV), and the average time to fixa- 
tion of a new mutation destined to be fixed is 4N. For what values of N 

a. Fixation time = 10 x loss time? 

b. Fixation time = 100 x loss time? 

16. For the model of gene conversion with gene identities given in Equation 
8.18, what value of A, makes the organization of the gene family irrelevant 
m the sense that/= c, = c 2 ? What is the common value in this case? (k is 

Molecular Population Genetics 395 

the probability that a particular member of the gene family becomes con- 
verted in any one generation.) 

17. For the model of gene conversion with gene identities given in Equation 
8.18, what are the values of /and C\ = c 2 when X = p? (The equations 
assume 4Nu « 1.) 

18. In a repetitive gene family being eliminated from the genome by dele- 
tion, if the fraction of sequences present at time that are still present at 
time I equals exp(-ht), show that the half life of the sequences equals 
-ln(V 2 )A. 

19. For a repetitive gene family eliminated as described in Problem 18, show 
that the average persistence ol an element is l/7(. 


Quantitative Genetics 

Artificial Selection Heritability Components of Genetic Variance 

Genotype x Environment Interaction Threshold Traits 

Genetic Correlation Evolutionary Quantitative Genetics - QTL Mapping 

any IMPORTANT problems in evolutionary biology begin with 
observations of phenotypic variation. Darwin formulated his 
ideas about evolution by natural selection based on observations 
of phenotypic variation. He struggled for many years to explain the cause of 
the phenotypic variability, but he was unsuccessful at one level because he 
did not know about Mendelian genetics. Darwin did, however, appreciate 
the importance of the observation that offspring resemble their parents. Con- 
tinuously varying traits, like body size, are influenced by both genetic and 
environmental factors. Crossing experiments demonstrate that the genetic 
components of these traits are not determined by single genes because the 
offspring do not fait into discrete classes with simple Mendelian ratios. 
Instead, what is observed is a general resemblance between parents and off- 
spring, suggesting that there is an underlying genetic basis to the trait, but 
that the genetic transmission is complex. 

A wealth of statistical tools have been developed for analyzing such poly- 
genic traits that do not show simple Mendelian transmission. These 
approaches allow not only a description of the genetic basis of observed phe- 
notypic distributions, but they also provide a means of predicting the distri- 
butions of phenotypes among offspring from observation of the parental 
phenotypes. Most polygenic traits are influenced by the environment to 
varying degrees, and they are often called multifactorial traits to emphasize 
their determination by multiple genetic and environmental factors. For 
example, variation in human weight is partly due to genetic differences 


398 Chapter 9 

among individuals and partly due to environmental factors such as exercise 
and level of nutrition. The study of polygenic inheritance goes beyond an 
oversimplified nature-versus-nurture dichotomy because it is concerned with 
specifying, in precise quantitative terms, the relative importance of nature, 
nurture, and their interactions, in accounting for variation in phenotype 
among individuals. Another compelling reason to study polygenic inheri- 
tance is that natural selection occurs at the level of the composite phenotype, 
and so fitness is a multifactorial trait. 

Since natural selection operates on phenotypes, there arises an immediate 
problem in understanding how phenotypic evolution is reflected in changes 
that occur at the molecular level. One of the great challenges facing popula- 
tion genetics is to unify the principles of molecular evolution with those gov- 
erning evolution at the phenotypic level. 


Multifactorial traits may be considered as resulting from the combined 
effects of many quantities, some genetic in origin and some environmental, 
and for this reason they are often called quantitative traits. The study ol 
quantitative traits constitutes quantitative genetics. 

Three types of quantilative traits may be distinguished: 

1. Traits for which there is a continuum of possible phenotypes are continu- 
ous traits; examples include height, weight, milk yield, and growth rate. 
The distinguishing feature of continuous traits is that the phenotype can 
take on any one of a continuous range of values. In theory, there are infi- 
nitely many possible phenotypes, among which discrimination is limited 
only by the precision of the instrument used for measurement. However, 
in practice, similar phenotypes are often grouped together for purposes 
of analysis. 

2. Traits for which (he phenotype is expressed in discrete, integral classes 
are meristic traits; examples include number of offspring or litter size, 
number of ears on a stalk of com, number of petals on a flower, and num- 
ber of bristles on a fruit fly. The distinguishing feature of meristic traits is 
that the phenotype of an individual is given by an integer that equals the 
number of elements of the trait that the individual displays. For example, 
a popular meristic trait used in experimental studies of quantitative 
genetics in Drosophila is the number of bristles that occur on the abdomi- 
nal segments or sternites. Normally there are 14 to 24 bristles per sternite. 
A male with 19 bristles on the fifth abdominal sternite therefore has a 
phenotype of 19. The distribution of numbers of abdominal bristles in a 
sample of Drosophila appears in Figure 9,1. When the number of possible 
phenotypes of a meristic trait is large (as it is with abdominal bristle 

Quantitative Genetics 399 

16 18 20 

Number of bristJes 

Figure 9.1 Number of bristles on the fifth abdominal sternite in males of a 
strain of Drosophila meianogaster. The smooth curve is that of a normal distribu- 
tion with mean 18.7 and standard deviation 2.1. (Data from T. Mackay.) 

number) then the line between continuous traits and meristic traits 
becomes indistinct. 
3. The third category of quantitative traits consists of discrete traits, which 
are either present or absent in any one individual. In these cases, the 
multiple genetic and environmental factors combine to determine an 
underlying risk or liability toward the trait. Liability values are not 
directly observable. However, an individual that actually expresses the 
trait is assumed lo have a liability value greater than some threshold or 

400 Chapter 9 

triggering level. Traits of this type are allied threshold traits, and exam- 
ples in human genetics include diabetes and schizophrenia With thresh- 
old traits, studies of affected individuals and their relatives permit 
inferences to be made about the underlying values of liability. These 
methods are discussed later in this chapter 

Quantitative traits are of utmost importance to plant and animal breeders, 
because agriculturally important characteristics such as yield of grain, egg 
production, milk production, efficiency of food utilization by domesticated 
animals, and meat quality are all quantitative traits. Even as modern methods 
of genetic engineering are applied to animal and plant improvement, quanti- 
tative genetics continues to play an important role because commercially 
desirable traits result from complex interactions among many genes. In addi- 
tion to being essential ingredients in plant and animal improvement pro- 
grams, the principles of quantitative genetics, appropriately modified and 
interpreted, can be applied to the analysis of quantitative traits in humans 
and natural populations of plants and animals. 


For Darwinian evolution to be possible, a necessary feature of the transmis- 
sion of traits is that offspring must tend to resemble their parents. Even 
before the rediscovery of Mendel's work, Francis Galton was collecting 
detailed statistical data on resemblance between parents and offspring 
(Chapter 2). We will demonstrate the central ideas of the transmission of 
quantitative traits, using some of the concepts that Galton developed. Then 
we will show how models of Mendelian inheritance can account for these 
features of hereditary transmission. Calculation of the degree of resemblance 
among relatives in terms of underlying Mendelian genetics was first provid- 
ed by Fisher (1918). Fisher's paper, notoriously difficult, was of great histor- 
ical importance to population genetics, because it provided the first demon- 
stration that multiple Mendelian genes could account for the observed pat- 
terns of transmission of multifactorial traits. 

Figure 9.2 shows a plot of the mean of male offspring for a quantitative 
trait (y values) against the phenotypic value of the father (x values), dis- 
played in the way Galton devised. The line is the best-fitting straight line, 
called the regression line, of offspring on parent. Regression is relevant to 
one of the primary aims in animal and plant breeding, namely to be able to 
improve attributes of the stock. An essential part of generic improvement is to 
be able to predict what sort of offspring would be obtained from a given pair 
of parents. For quantitative traits, prediction cannot be done exactly, but a 
statistical description of the most likely offspring can be obtained by the pro- 
cedure of plotting the parent-offspring regression. For reasons that will 

Quantitative Genetics 401 

£ & 

■£■ 2 

2 6 


2000 I- 


J I 1 L 

1800 1900 2000 2100 2200 2300 2400 2500 2600 
Pupal weight of sires (micrograms) 

Figure 9.2 Mean weight of male pupae of the flour beetle Tribolntm castaneum, 
against pupal weight of father (sire). Each point is the mean of about eight male 
offspring. The regression coefficient of male offspring weight on sire's weight is 
b = 0. 11, and h z is estimated as 2b. (Courtesy of F.D. Enfield.) 

become clear in a moment, we are interested in the slope of the regression 
line. The slope is most easily expressed in terms of the covariance of x and y, 
defined as Cov(*,y) = [L(x - x)(y - y)]/n = (*y) - (x)(y), where the bar over a 
symbol means the average. This quantity is the sample covariance of x and y. 
The slope of the line through a cluster of points having the smallest summed 
squared distance to the points is the regression coefficient: b = 
Cov(jf,y)/Var(.r). A related quantity that also arises in quantitative genetics is 
the product-moment corr elation coeffici ent, often simply referred to as the 
correlation: r = Cov(*,y)/War (jc)Var (y). 

An important concept in statistics is the distinction between parameters 
and estimators. Descriptors that are calculated from a set of data to describe 
a sample (such as the sample mean and sample variance) are considered as 
estimates of the parameters that determine the true dislribution. The sample is 
thought of as having been drawn from some perfect distribution {whose 
parameters we can never know), and the sample statistics give us a best 
guess at what that true distribution is. In statistics, the distinction is general- 
ly made by unadorned Greek symbols for parameters and circumflexes for 
estimates. The usual symbols are: u for the parametric mean, o 2 for the 
variance, o^ for the covariance of x and y, and p for the correlation. Using the 
circumflex notation for estimates, p, denotes the sample mean of x, so that, 
ji r = x. Similarly, 6/ = Var(jr), and G xy = Cov(;t,y) are the sample estimates of 
the variance of x and the covariance of x and y. When describing models of 
quantitative genetics, it is the true distributions that are of interest, and so the 
parameters are used. When describing the results of an experiment, it is more 

402 Chapter 9 

appropriate to use the notation for estimates The covariance and the correla- 
tion coefficient are convenient measures of the degree of association between 
x and y. If x and y are independent, then o xv and p are both zero. Since the 
covariance between any two variables measures their degree of association, 
the covariance may be positive or negative. Positive covariance means that 
values of x and y tend to increase or decrease together; negative covariance 
means that, as one variable increases, the other tends to decrease. The limit- 
ing values of the covariance are -o x O v on the negative side, and o\a y on the 
positive. The limits are achieved only when the variables demonstrate a per- 
fect linear relationship with each other. 

Returning now to Figure 9.2, if Cov(jr,y) represents the covariance 
between phenotypic values of fathers (sires) and those of their male off- 
spring, and Var(x) represents the variance of phenotypic values of the fathers, 
then the slope of the regression line is equal to the regression coefficient, 
Cov(x,y)/Var(x), which can be seen as follows. Suppose that the equation of 
the line is represented as 

y = c + bx 


where c and b are constants, b being the slope. Taking means of both sides 

y = c + bx 
subtracting the second equation from the first yields 

y - y = (c + bx) - (c - bx) = b{x -x) 
Now multiply through by x - x to obtain 

{x-x)ty-y) = b(x-xf 
Taking means of both sides produces 

Cov{jr,y) = bVar(x) 
In other words, the slope b of the regression line equals 
b = CovCr,y)/Var(r) 






As noted, the slope is called the regression coefficient of offspring on one 

A graphical interpretation of regression is illustrated in Figure 9.3, which 
shows the distribution, in two dimensions, of the variables x and y. The vari- 
ables may represent, for example, the phenotypic values of parents (x) and 
offspring (y). When there is no association between x and y, the distribution 
is a random scatter of points, and any line through the points fits equally 
badly. Figure 9.3 shows the appearance of the scatter of points for different 

;> = 02 

• • 

& = 06 

l) = 0.9 


Figure 9.3 Plots of random scatters of points having the same variance on the 
r axis but a range of covariances. With zero covariance (top), the regression coef- 
ficient is zero. A stronger linear trend results in a higher regression coefficient 

404 Chapter 9 

values of association between the two variables. Note thai, while each para- 
meter measures an aspect of association between x and i/, the covariance, the 
regression coefficient, and the correlation coefficient are different things. For 
example, the covariance and the regression coefficient are unbounded, 
whereas the correlation coefficient must be between -1 and 1. 

Two extreme examples may help clarify parent-offspring regression. At 
one extreme, if there were no genetic contribution to the trait, then the scat- 
tergram might appear as a random scatter as in the top panel of Figure 9.3 
with no tendency to follow a line. In such a case, knowing the phenotype of 
the parents would not help to predict that of the offspring, because there 
would be no parent-offspring resemblance. On the other hand, even with no 
genetic variation, the points might nevertheless show a substantial tendency 
to follow a line. To see why this is so, consider families living in different 
environments. In favorable environments with plenty of food and resources, 
parents and offspring might all be big and strong, while in unfavorable envi- 
ronments, parents and offspring might be small and sickly. A parent-off- 
spring plot would show that big strong parents have big strong offspring, 
while small sickly parents have small sickly offspring, even though there is 
absolutely no genetic basis lor the trait. The tendency of points to follow a 
line in a parent -offspring scattergram tells us nothing about the genetic basis 
of the trait, unless we are willing to make some claims (which hopefully can 
be tested experimentally) about the environmental covariance (the tendency 
of parents and offspring to resemble one another due to shared environ- 
ments). Only if there is no environmental covariance will the parent -offspring 
regression indicate a degree of genetic influence on the resemblance. The 
possibility of environmental covariance is absolutely critical in human quan- 
titative genetics, where the influence of shared environments can be very 
subtle and very strong. 

Assuming now that the environmental covariance is zero, the regression 
coefficient b of offspring on one parent can be calculated for any 
random-mating population, and it indicates the degree to which the variance 
in the trait is determined by genetic variation. It is for this reason that the 
regression coefficient is related to an important quantity in quantitative 
genetics called heritability. There are two types of heritability that will be 
distinguished shortly, but for now, we note that the "narrow -sense" heri- 
tability (h 2 ) can be estimated from the relationship 

b = V s /i 2 


The V 2 occurs in Equation 9.7 because the regression involves only a sin- 
gle parent (the father, in the case of Figure 9.2), and only half of the genes 
from any one parent are passed on to the offspring. In Figure 9.2, b = 0.11, 
so h 2 = 0.22. Notice the considerable scatter among the points in the figure, 
which represents data from 32 families, Because this sort of scatter is typical, 
heritability estimates tend to be quite imprecise unless based on data from 

Quantitative Genetics 405 

several hundred families. Note however, that even with an enormous sample, 
there would be no less scatter to the points — we would merely have a more 
accurate measure of how much scatter there is. One further point about 
Figure 9.2: in organisms such as mammals, the regression is better performed 
on the father's phenotype, rather than on the mother's, in order to avoid 
potential bias in the estimate of heritability caused by such maternal effects 
as intrauterine environment. In organisms where nurturing does not impart 
significant maternal effects, scattergrams can be constructed with the x axis 
being the average of the two parents (the midparent) and they axis the off- 
spring phenotypes. From this sort of plot the regression coefficient is equal to 
the heritability: in symbols, when the x axis is the midparent, b = h 2 . 

PROBLEM 9. 1 This example of calculating h 2 from parent-offspring 
regression uses data from Cook (1965), who studied shell breadth in 
119 sibships of the snail Atlanta nrbustorum. For computational conve- 
nience, the data have been grouped into six categories. Estimate the 
heritability of shell breadth from these data. 

Number ofstbships Midparent value (mm) Offspring mean (mm) 



















ANSWER Letting x refer to the midparent value and f refer to the 
offspring mean, then, I* 20*2626, y « 20.1786, It, 2 * 49,823.4375, %, 2 
* 49,267.1875, 6 V * 5.1*26), d^ * 8.1W1, and f - & « 0.63. 0n actual 
practice we might not want to group the data into categories, because 
there is some loss of accuracy from grouping. The regression .coeffi- 
cient for the ungrouped data is b ■ 0.70. In addition, it should be 
noted that there is substantial assortative mating for shell breadth, 
and so the heritability estimate is artificially large.) 

To this point we have shown that heritability can be used to measure the 
degree of resemblance between parents and offspring. Although the defini- 
tion of heritability in terms of the regression coefficient between midparents 
and offspring is reasonable, heritability defined in this manner is merely a 
descriptive, empirical quantity because it makes no assumptions about 

406 Chapter 9 

genetics. In the next section we show how heritability in this purely statisti- 
cal sense can be used to predict the result of artificial selection. 


The deliberate choice of a select group of individuals to be used for breeding 
constitutes artificial selection. The most common type of artificial selection 
is directional selection, in which phenotypically superior animals or plants 
are chosen for breeding. Although artificial selection has been practiced suc- 
cessfully for thousands of years (for example, in the body size of domesti- 
cated dogs), only during this century have the genetic principles underlying 
its successes become clear. Understanding the genetic principles of artificial 
selection permits prediction of the rapidity and amount by which a popula- 
tion can be altered through artificial selection in any particular generation or 
small number of generations. The theory of artificial selection is also strong- 
ly motivated by the idea that natural selection may operate in a similar way. 
For example, if only those individuals with greater than a certain amount of 
body fat survive, or only those individuals with less than a critical rate of 
evaporative water loss survive, then natural selection acts on the distribution 
of phenotypes in much the same way that breeders select characters of 
agricultural importance. 

Artificial selection in outcrossing, genetically heterogeneous populations 
is usually successful in that the mean phenotype of the population changes 
over generations in the direction of selection (provided the population has 
not previously been subjected to long-term artificial selection for the trait in 
question). In experimental animals, the mean of almost any quantitative trait 
can be altered in whatever direction desired by artificial selection. For exam- 
ple, in Dmsophila, body size, wing size, bristle number, growth rate, egg pro- 
duction, insecticide resistance, and many other traits can be increased or 
decreased by selection. In domesticated animals and plants, birth weight, 
growth rate, milk production, egg production, grain yield, and countless 
other traits respond to selection. Figure 9.4 shows the results of a long-term 
selection program involving oil content in corn. Amazingly, the line selected 
for high oil content is still responding after more than 90 generations (Dudley 
and Lambert 1992). 

The general success of artificial selection in outcrossing species indicates 
that a wealth of genetic variation affecting quantitative traits exists. On the 
other hand, in a genetically uniform population, the mean phenotype of the 
population cannot usually be changed through artificial selection, because 
genetic variation is required for progress under artificial selection. For exam- 
ple, in experiments with the Princess bean, Johanssen (1909) found that arti- 
ficial selection consistently resulted in failure when practiced within 
essentially homozygous lines. He obtained this result because, in genetically 

Quantitative Genetics 407 

30 40 50 

Figure 9.4 Results of a famous long-term experiment selecting for high and 
low oil content in corn seeds. Begun in 1896, the experiment has the longest 
duration of any on record and still continues at the University of Illinois. Note 
the steady, linear rise in oil content shown by the upper curve. The lower curve 
started on a roughly linear path and continued so for about ten generations, but 
then the response tapered off, presumably because zero percent oil is an 
absolute lower limit for the trait. (After Dudley and Lambert 1992.) 

homozygous populations, the only source of genetic variation comes from 
new mutations. In contrast, since genetically variable populations usually 
respond to artificial selection, and genetically uniform populations do not 
respond, the response to artificial selection might be used as a measure of the 
extent of genetic variation in the trait. This notion of selection response 
reflecting genetic variation will be formalized in the next section. 

Prediction Equation for individual Selection 

When individuals are selected for breeding based solely on their own indi- 
vidual phenotypic values, the type of artificial selection is called individual 
selection. Figure 9.5 illustrates a variety of individual selection called trun- 
cation selection. The curve in panel A represents the normal distribution of 
a quantitative trait in a population, and the shaded part of the distribution to 

408 Chapter 9 

S = H S -M 

Figure 9.5 Diagram of truncation selection. (A) Distribution of phenotypes in 
the parental population, mean u. Individuals with phenotypes above the trun- 
cation point (T) are saved for breeding the next generation. The selected parents 
are denoted by the shading and their mean phenotype by u.,. (B) The mean of 
the distribution of phenotypes in the progeny is denoted u'. Note that u' is 
greater than u but less than u s . The quantity 5 is called the selection differential, 
and R is called the response to selection. 

the right of the phenotypic value denoted T indicates those individuals 
selected for breeding. The value T is called the truncation point. The mean 
phenotype in the entire population is denoted u, and that of the selected par- 
ents is denoted Ug. When the selected parents are mated at random, their off- 
spring have the phenotypic distribution shown in panel B, where the mean 
phenotype is denoted u'. 

An example of truncation selection for seed weight in edible beans is 
shown in Figure 9.6. In this example, T = 650 mg, u = 403 .5 mg, u s = 691 .7 mg, 
and p' = 609.1 mg. In this case — as is typical of truncation selection — the off- 
spring mean u' is greater than the previous population mean u but less than 
the parental mean p s . The reason u' is greater than u is that some of the 
selected parents have favorable genotypes and therefore pass favorable genes 

Quantitative Genetics 409 


r u 

r 403.5 









S = M s - M 
= fi91 7 - 403 5 
= 288 2 

150 250 350 450 550 650 750 
Weight of seed (milligrams) N 


100 - 

= 609.1 - 403.5 
= 205.6 

150 250 350 450 550 650 750 850 950 
Weight of seed (milligrams) 

Figure 9.6 Truncation selection experiment for seed weight in edible beans of 
the genus Pliaseolus, laid out as in Figure 9.5. The truncation point (7) is 650 mg. 
The selection differential S is the difference in means between the selected par- 
ents and the whole population. The response R is the difference in means 
between the progeny generation and the entire population in the previous gener- 
ation. The quantity R/S is the realized heritability. (Data from Johannsen 1903.) 

on to their offspring. At the same time, p' is generally less than p s for two 

1. Because some of the selected parents do not have favorable genotypes; 
rather, their exceptional phenotypes result from chance exposure to 
exceptionally favorable environments. 

2. Because alleles, not genotypes, are transmitted to the offspring, and 
exceptionally favorable genotypes are disrupted by Mendelian segrega- 
tion and recombination. 

410 Chapter 9 

The difference in mean phenotype between the selected parents and the 
entire parental population is the selection differential and is designated S. 
In symbols, 

S = p s - u 9.8 

The difference in mean phenotype between the progeny generation and the 
previous generation is the response to selection and is designated R. 

J? = li' -li 


In quantitative genetics, any equation that defines the relationship 
between the selection differential S and the response to selection R is known 
as a prediction equation. Since selection can be applied to a population in 
many different ways (others will be discussed later in this chapter), the pre- 
diction equation may differ corresponding to the different modes of selec- 
tion. A genera] prediction equation that applies to many forms of selection, 
including truncation selection (the type of selection illustrated in Figure 9.5), 

R = h 2 S 


where h 2 is the realized heritability. Later in this chapter, we will show that 
the realized heritability is identical to the narrow-sense heritability defined 
by regression, provided the phenotypes and the magnitudes of genetic 
effects follow a bell-shaped Gaussian distribution. These assumptions are 
necessary in order to apply regression to the problem. This equivalence 
emphasizes again that heritability can be understood at several different lev- 
els. Equation 9.10 implies that the realized heritability ol a trait can be inter- 
preted as a mere description of what happens when artificial selection is 
practiced. In Figure 9.6, for example, S = 288.2 and R = 205.6, so h 2 =R/ S = 
205.6/288.2 = 71.3%. When estimated like this from empirical data, h 2 is the 
realized heritability, and it simply summarizes the observed result. 

PROBLEM 9.2 Below are data on the number i of sternital bristles in 
samples from two consecutive generations d and G 2 of art experi- 
ment in directional selection for increased bristle number. In the Gj 
generation, individuals with 22 or more bristles (enclosed in brackets) 
were mated together at random to form the Gi generation. Estimate 
the realteed heritability of the number of sternltal bristle in this exper- 
iment (Data kindly provided by Trudy Mackay. In order to make the 

Quantitative Genetics 41 1 

sexes comparable, the value of 2 has been added to the bristle number 
In males.) 




















■ 22 
















, i .,,„. 



ANSWEfc Brtdi^(rfd*ilM^areii»2220/15*19AAs = 22.7,A' 
* 203S/11 * 20.1. The selection differential S = 22.7 - 19.3 = 3.4 (Equa- 
tion 9J) and the responie k • 20.1 - 19.3 * 0.8 (Equation 9.9). The 
!«silfarfherftar*rf estimated from Equation 9.10 is ft 2 = 0.8/3.4 = 0.235. 

Data from experiments by Mackay (1985) demonstrate the potential sig- 
nificance of new mutations in quantitative genetics. The base population on 
which selection was performed was created by a cross that mobilizes the 
transposable element P that results in new P-element insertions in the 
germline and a syndrome of partial infertility and other reproductive abnor- 
malities known as hybrid dysgenesis. As a control, a genetically identical 
base population was formed by the reciprocal cross, in which the P element 
is not mobilized and hybrid dysgenesis does not occur. In the dysgenic cross, 
the realized heritability in abdominal bristle number was increased by 40% 
as compared with the nondysgenic control. More strikingly, the phenotypic 
variance of bristle number in the selected dysgenic lines increased by a factor 
of three over the course of eight generations. These results demonstrate that 
the genetic variation affecting quantitative traits may even include insertions 
of transposable elements. On the other hand, other comparable experiments 
using hybrid dysgenesis have not given such dramatic results. 

Selection Limits 

Progress under artificial selection does not continue forever. Any population 
must eventually reach a selection limit, or plateau, after which it no longer 
responds to selection. One of the reasons that a population eventually reach- 
es a plateau is exhaustion of genetic variance, such that all alleles affecting 
the selected trail have become fixed, lost, or are otherwise unavailable for 
selection. With no genetic variance, no progress under individual selection 

412 Chapter 9 

can be achieved. However, many experimental populations that have 
reached a selection limit readily respond to reverse selection (selection in the 
reverse direction of that originally applied), so genetic variance affecting the 
trait is still present. Indeed, in such populations, the phenotype may change 
in the direction of its original value if continuing artificial selection is simply 
suspended (relaxed selection). The consequences of relaxed selection for one 
example in Drosophila are illustrated in Figure 9.7. 

One frequent reason for the occurrence of selection limits in populations 
with considerable genetic variation is that artificial selection is opposed by nat- 
ural selection. In mice, for example, response to selection for small body size 
ultimately ceases because small animals are less fertile than larger ones, and the 
smallest animals are sterile (Falconer and Mackay 1996). Selection for small 
body size gradually becomes less effective due to the opposing effects of natur- 
al selection until, eventually, no further progress is possible. When selection is 
relaxed, the natural selection is unopposed and results in a retrogression in the 
artificially selected trait. Some backward slippage with relaxed selection also 
results from diminution in the linkage disequilibrium that usually builds up 
during the course of long-term artificial selection. If natural selection opposes 
the artificial selection, then when artificial selection is relaxed, natural selection 
results in at least a partial return to the initial phenotypic mean. 

40 50 60 
Genera Hon 

Figure 9.7 Response to selection for wind tunnel flight speed in Drosophila 
meimwgasfer. One line was maintained without selection for 30 generations start- 
ing at generation 65, and another was maintained without selection for 10 gen- 
erations starting at generation 85 (triangles). In these examples, the flight 
performance did not degrade after selection was relaxed. Apparently the selection 
response occurred with little correlated response on fitness. (After Weber 1996.) 

Quantitative Genetics 413 

TABLE 9.1 


Character selected 

Direction of 

Total response Half-life of response* 

Weight (in strain N) 


3.4 a,, 



5.6 a,, 


Weight (in strain Q) 


3 9 o,, 



3 6o> 


Growth rate 


2.0 a,, 



4.5 o,, 


Litter size 


1-2 a,, 



05 o,, 


Source- From Falconer 1977. 

"Total response is expressed as a multiple of the initial phenotypic standard deviation, G p . 
h Half-life of response is the number of generations taken to progress halfway to the selection 
limit; here the half-life is expressed in multiples of effective population number (N) 

In most genetically heterogeneous populations, artificial selection can 
change the phenotype well beyond the range of variation found in the origi- 
nal population. Pertinent data for populations of mice are presented in Table 
9.1. As can be seen, a total selection response of three to five times the origi- 
nal phenotypic standard deviation is not unusual, and for selection to change 
a population of effective size N halfway to its selection limit typically 
requires about 1/2M generations. 

In some cases the total response to artificial selection is very large. For 
example, in a long-term selection experiment for pupal weight in Triboti- 
um, in which the base population consisted of the progeny of a cross 
between two inbred lines, 100 generations of selection resulted in a popula- 
tion in which the mean pupal weight in the selected population was 17 
standard deviation units greater than the mean in the base population 
(Enfield 1980). The ability to select a population in which virtually every 
phenotype is greater than the maximum in the original population strikes 
many students as paradoxical. It does seem plausible to argue that, if all of 
the alleles eventually selected are already present in the original popula- 
tion, then all possible favorable genotypes should be present also, though 
perhaps at low frequency. The fallacy in the argument is that real popula- 
tions subjected to artificial selection are actually small in size, consisting of 
at most a few hundred organisms. Therefore, if the favored alleles are rare, 
then the frequency of the favored genotypes may be so small thai the 
expected number of such genotypes will be much smaller than one, and so 
the superior genotypes, while theoretically possible, do not actually exist in 
the original population. 

414 Chapter 9 

Some traits consistently fail to respond to artificial selection, suggesting 
a lack of suitable genetic variation. Bilateral symmetry is an example of a 
trait that has not been amenable to change by artificial selection. The failure 
of Maynard-Smith and Sondhi (1961) to create bilateral asymmetry in 
Drosophila by selecting for an excess of dorsal bristles on the left side is typi- 
cal. The apparent lack of genetic variation determining bilateral asymmetry 
is of interest in regard to embryonic development for it implies that the 
genetic control of development of symmetrical structures specifies patterns 
that are common to the left and the right sides of the body. That is, rather 
than left-bristle genes and right-bristle genes, there appear to be generic 
bristle genes whose spatial expression is determined symmetrically. Of 
course asymmetrical structures do exist (such as the vertebrate heart) and 
recently inroads have been made in understanding the molecular genetic 
basis for this asymmetry (rsaac et al. 1997). Genes that affect left-right asym- 
metry do not do so in a continuous manner — rather they either successfully 
establish the asymmetry or they do not; absence of symmetry is fatal. 

Not all traits with heritable variation obey the prediction equation and 
show a simple linear change in the mean. Sometimes a trait responds to 
directional selection for a few generations, then ceases to respond, but later 
responds again as selection is continued. One possible mechanism for this 
stop-and-start response is that the population at a plateau is in linkage dise- 
quilibrium, and it takes time for recombination to break up the allelic associ- 
ations and release the latent genetic variation. This phenomenon was 
observed in a long-term study of the quantitative genetics of wing veins in 
Drosophila (Scharloo 1987). In this case a bimodal phenotypic distribution 
was also generated during selection (Figure 9.8), which was proposed to 
reflect a nonlinear mapping from genetic and environmental factors to the 
determination of phenotype. 

As we have seen, heritability can be interpreted in purely statistical terms 
with no genetic content. However, if we postulate that there are Mendelian 
genes underlying the phenotypes, then the genetic underpinning allows us 
to do more than merely describe statistical relations among individuals. By 
bringing Mendelian genetics into the picture, we will see why the response to 
any kind of artificial selection is determined by the magnitude of the heri- 
tability. In particular, the genetic basis of response to artificial selection comes 
from changes in gene frequencies and sometimes also to changes in linkage 


When h 2 is interpreted as realized heritability, then Equation 9.10 is hardly a 
"prediction equation" inasmuch as it merely describes what has already hap- 
pened in one generation of selection. Ol course, the equation could be used 
to predict the result of the next generation of selection, but artificial selection 

. a/ yw^ 



,/^n , 

50 1(10 50 

Length of fourth wing vein as percentage of llurd wing vein 

Figure 9.8 Frequency distributions in females (left) and males (right) of a line 
of Drosophila meltmogaster selected for fourth wing vein length The broken lines 
represent selection for a short vein, and solid lines represent selection for a long 
vein. In the line selected for long veins, both sexes displayed a bimodal fre- 
quency distribution when the relative vein length was approximately 60-S0%. 
(From Scharloo 1987.) 



416 Chapter 9 

is impossible in many natural populations and is time consuming and 
expensive in many domesticated plants and animals, tt would therefore be 
useful if one could estimate heritability without actually performing any 
artificial selection. If the heritability h l could be estimated in such a manner, 
then Equation 9.10 would be a true prediction equation in the sense that the 
response R could be predicted for any selection differential S, based on the 
estimated value of h 1 . Such an estimate of h z is indeed possible, but it 
involves an understanding of heritability at a level that includes the under- 
lying genetic basis of quantitative traits. 

An understanding of the genetics behind Equation 9.10 requires three 
items: (1) a concept of how alternative alleles of a gene aftect a quantitative 
trait; (2) a determination of how selection changes the allele frequencies; and 
(3) a calculation of how much the mean of the trait increases as a result of the 
change in allele frequency. Some detail is required to establish these three 
items, but the detail is necessary in order to understand the genetic meaning 
of heritability. 

Nilsson-Ehle (1909) was the first to show that a trait with a nearly contin- 
uous distribution of phenotypes could result from the joint effects of several 
genes. The trait of interest is the intensity of red pigment in the glume of 
wheat Tritiaim vufgare, which Nilsson-Ehle found to result from three 
unlinked genes, each with two alleles. The situation is exceptionally simple 
for a quantitative trait; the environment has a negligible effect on phenotype, 
because the alleles of each gene are additive (i.e., heterozygotes have a phe- 
notype that is exactly intermediate between homozygous phenotypes), and 
because the genetic effects are also additive across genes {i.e., the total genet- 
ic effect of any three-locus genotype is just the sum of the separate effects of 
each gene). To simplify matters, consider just two of the genes, and let their 
alleles be denoted (A, a) and (B, b). With additivity within and across genes, 
we may assume that the genotype aabb has a color score of (white) and that 
each A or B allele in the genotype contributes one unit of red pigment. Figure 
9.9 A shows the nine possible two-gene genotypes, their frequencies with ran- 
dom mating when the allele frequencies of ,4 and B are both V 2 , and the color 
score of each genotype assuming additivity. The mean color score of the pop- 
ulation is 2. Indeed, when the allele frequencies of A and B are both p, then 
the mean of a population with random mating can be shown to equal Ap. To 
connect this trait with the prediction, Equation 9.10, suppose that the two 
lowest phenotypic classes (i.e., and 1) are selected as parents of the next 
generation. We first calculate u s , u', S, R, and h 2 = R/S using the allele fre- 
quency of A and B among selected parents; then we use the mean = Ap for- 
mula to obtain the mean of the offspring with random mating. 

In this example, u = 4(V 2 ) = 2 is given. The selected parents consist of 
genotypes Aabb, aaBb, and aabb with respective frequencies 2 / 5 , %, and V5, and 
the mean of parents = u s = ( 2 / 5 )(l) + (%)(1) + %){Q) = %■ The allele frequency 
of A and B among parents = (V2)( 2 A) = %, and therefore the mean among off- 

Quantitative Genetics 4 1 7 



AA Aa 


Vu y* 



y.* % 

2 A 




(B) Dominant 






l /l6 






Figure 9.9 Frequencies of two-locus genotypes (outside circles) and respective 
phenotypes (within circles) in a population with allele frequency V2 for each 
locus. Panel A illustrates the case of additivity of effects at each locus and across 
loci. In panel B, A and B are each dominant to a and b respectively, but the 
effects of the two loci are additive. 

spring is u' = 4(Vs) = 4 / 5 . Then S = (%) - 2 = -% and R = (%) - 2 = -%, so h 2 = 
R/S = 1.0. As demonstrated in the next paragraph, this high heritability is 
due to the additivity within and across genes and not merely to the fact that 
environmental effects are negligible. 

Figure 9.9B refers to a hypothetical situation in which the A and B alleles 
are dominant but still additive across genes. Thus, genotypes AA, Aa, BB, 
and Bb each add one unit of red pigment to the phenotype. In this case, it 
can be shown that the mean of a random-mating population with allele fre- 
quencies of A and B both equal to p is given by 2p(l + q), where q = 1 - p. If 
the two lowest phenotypic classes (i.e., and 1) are selected as parents of the 

418 Chapter 9 

next generation, then the mean of parents = p s = (V7XI) + (V7KI) + (V7KI) + 
( 2 / 7 )(l) + (V7MO) = V7- The allele frequency of A and B among parents is 
P - (V7) + (V^X 2 /?) = 2 /7/ and the mean of the offspring is thereiore p' = 
2( 2 / 7 )[l + m\ = 48 /«- Thus, S = (*/ 7 ) - (-y 2 ) = -y ]4 and R = («/ 49 ) - (%) = -%. In 
the case where A and B are dominant, so h 2 = 51 / 63 = 0.81. Although environ- 
mental effects on seed color are still negligible in the dominance case, the 
heritability has become less than 1.0. This perhaps surprising result occurs 
because certain genetic effects (such as those resulting from dominance or, 
in other examples, nonadditivity across genes) are not useful in changing a 
population by means of the type of individual selection discussed here. 

To see how an underlying genetic model can be formulated for continuous 
characters, refer to Figure 9.10, which shows the normal distribution of a trait 
in a hypothetical random mating population In truncation selection, all indi- 
viduals with phenotypes above the truncation point Tare saved for breeding, 
and the shaded area B of the distribution represents the proportion of the pop- 
ulation selected. (The total area under any normal density equals 1.) The 
height of the normal density at the point T is denoted Z, and, as before, the 
mean phenotype among the selected individuals is called ps. One of the spe- 
cial properties of the normal distribution to be used below is that 



To determine the amount of increase in mean phenotype in a population 
resulting from one generation of truncation selection, we first imagine a gene 

Figure 9.10 Normal distribution of a quantitative trait in a hypothetical popu- 
lation, showing some important symbols used in quantitative genetics. Here p is 
the moan of the population, T the truncation point, Z the height (ordinate) of the 
normal density at the point T, B is the shaded area under the normal curve to 
the right of T, and u P is the mean among selected parents. 

Quantitative Genetics 419 

that affects the trait in question and that has alleles A and A' with respective 
allele frequencies p and q. Because of random mating, genotypes AA, AA' , 
and A' A' are present in the population with frequencies p , 2pq, and q 2 , 
respectively, but the individual genotypes cannot be identified through their 
phenotypic values because of the variation in phenotype caused by environ- 
mental factors and genetic differences in other genes. If the genotypes could 
be identified, their individual distributions of phenotypic value might appear 
as shown in Figure 9.11. Each distribution is normal and has the same vari- 
ance, but the means are very slightly different. The mean phenotypes of AA, 
AA', and A' A' genotypes are denoted p* + a, p* + d, and p* - a, respectively. 
The symbols a and d serve as convenient representations of the effects of the 
alleles in question on the quantitative trait. The difference between means of 
homozygotes is (p* + a) - {p* - a) = 2a, and d/a serves as a measure of domi- 
nance. The relationship d = a means that A is dominant, d = implies addi- 
tivity (heterozygotes exactly intermediate in phenotype between the 
homozygotes), and d = -a means that A' is dominant. (Use of a and d in this 
manner simplifies some of the subsequent formulas.) Calculation of a and d 
for an actual example involving two alleles that affect coat coloration in 
guinea pigs is illustrated in Table 9.2. In this case, a = 0.127, d = -0.016 
(the negative sign on d means that the c 4 allele is partially dominant), and 

Distribution in 
whole population 

Distribution in AA 
Distribution in AA' 
Distribution in A' A' 

u*-fl ji'-i-d n*+r? 

Figure 9.11 Same distribution as in Figure 9.10, showing the slightly different 
distribution of phenotypic value among the Ihree genotypes {AA, AA', and 
A' A') for a gene with two alleles that contributes to the quantitative trait. The 
means of the distributions of AA, AA', and A' A' are symbolized p* + a, u* + d, 
and u* - a, respectively. 

420 Chapter 9 

TABLE 9.2 



Amount of black coloration 1 ' 

c'c r (AA) 
tV (AA') 
c-'c' 1 (A' A') 

1 202 = M * + a = 1 075 + 0.127 
1.059= u* + rf = 1.075- 0.016 
0.948 =u*- a = 1.075- 0.127 

p* = {1.202 + 0.948)/2 = 1.075 
a = 1.202 - 1 075 = 0.127 
5=1.059- 1.075= -0.016 

Source Data from Wright 1968 

"The calculations to be earned out first are those beneath the dMa; then the right-hand 
column is completed 

Here the amount of black coloration is measured as arcsin (41), where x is the percentage of 
black coloration on the animal. For cV, <V, and cV genotypes, the corresponding x values are 
877.., 76%, and 66%, respectively. 

df a = -0.126. Assuming Hardy-Weinberg genotype frequencies, the mean 
phenotype in the entire population is 

u = pV + a) + 2p?(u* + £/) + q\\x* - a) 9.12 

PROBLEM 9.3 Crosses between the Danmark (P,J and Red Currant 
(Pj) tomato gave the following mean fruit weights and their tog trans- 
forms. P, and P 2 are the parental means, Fi and F 2 are the first and sec- 
ond hybrid generation, and Bj and B2 are the progeny of the backcross 
of Fi x P] and Fi x F h respectively. 

Expectatf mwn Mean welghi Log (weight) 


U + fl 

10.36 ± 0.581 

0.98 ±0.03 

p 2 


0.45 ± 0.017 

-0.36 ±0.02 


u + rf 

2.33 ± 0.130 

0.33 ±0.03 


u + '/jtf 

2.12 ± 0.105 

0.27 ±0.01 


H + y 2 (a+<f) 

4.82 ± 0.253 

0.64 ± 0.02 


ji + V&W-*) 

0.97 ±0.045 

-0,05 ± DM 

Use this information to calculate jj, a, and d for both the weights and 
the log transformed weights. Do the simple weights or the log trans- 
formed weights fit the model better? (Data from Powers 1951.) 

Quantitative Genetics 421 

AN SWE R The difference between the two parental means is 2a, so a = 
(10.36 - 0.45)/2 = 4.96. This gives fi = 5.4. The V x has a mean u + d = 2.33, 
so 1 = 2.33 - 5.4 = -3.07. The F 2 should have mean (%)(\i + a) + 
V 2 (0 + (f) + V 4 (p - a) = (i + Vid = 5.4 + y 2 (-3.07) * 3.86. The B^ refers to 
backcrosses of the Fi to Pi, which should yield one-half of genotypes 
like P! and one-half of genotypes like the F u so the mean should be 
V 2 (p + a) + V 2 (u + d) » u + *fa{a + d) = 6.34. Similar reasoning gives an 
expected mean for Bj of 1 .38. The estimates for the means of me Fa, Bt, 
and B 2 do not fit very well at all. Trying again with the log trans- 
formed data, we get a ■ 0.67, ji « 0.31, and S « 0.02. The expected 
means for the F 2 , Bj, and B 2 are then 0.31 + V^Q.Ol)-* 0.32, 0.31 + 
i/ 2 (0.67 + 0.02) m 0.65, and 0.31 + y 2 (0.02 - 0.67) * -0.01. The log trans- 
formed data clearly fit much better, suggesting that the better scale to 
use for the quantitative genetic models is the log transformed scale. In 
actual practice, the entire set of data is used to estimate u, a, and d by 
a melhod known as least squares, and me goodness of fit of the model 
to the data can be tested by a chi-square test. 

Effects of the scale of measurement are known as scaling effects. For exam- 
ple, the a and d values in Table 9.2 are different when calculated for the per- 
cent of black coloration x or for the arcsin (V*) tabulated values. Since 
estimates of the additive and dominance values of alleles depend on scaling, 
so does the heritability. An important point is that the equivalence between 
the heritability defined by parent-offspring regression and by realized heri- 
tability depends on the correct choice of scaling. Only one scaling provides a 
normal Gaussian distribution of phenotypes and of genetic effects, and that 
is the appropriate scaling that yields the prediction Equation 9.10. 

Change in Gene Frequency 

Suppose for the moment that we were practicing artificial selection for 
increased amount of black coat coloration in the guinea pigs in Table 9.2. 
Selection for black-coat coloration in a population containing both the c" 
(i.e., A) and c d (i.e., A') alleles would be successful in increasing the allele 
frequency of A, and the average amount of black coloration among indi- 
viduals of the next generation would increase. Therefore, in order to 
calculate the expected increase in black coloration in one generation of 
selection, we must first calculate the corresponding change in the allele 
frequency of A. An equation for change in allele frequency with nalural 

422 Chapter 9 

selection was derived in Chapter 6, which remains valid for artificial selec- 
tion if we agree to interpret the "fitness" of an individual as the probabil- 
ity that the individual is included among the group selected as parents of 
the next generation. With this interpretation of fitness, differences in fit- 
ness (i.e., reproductive success) of AA, AA', and A'A' genotypes corre- 
spond to the differences in area to the right of the truncation point in 
Figure 9.11, because only those individuals in the shaded area are allowed 
to reproduce. The differences in area are easy to calculate if you shift or 
slide each curve horizontally until its mean coincides with u*. The A'A' 
curve must slide a units to the right, and the AA' and AA curves must slide 
d and a units to the left. This shifting brings the distributions into coinci- 
dence, but it slides the truncation points slightly out of register, as shown 
in Figure 9.12. The difference in "fitness" between AA and AA', denoted 
iv u - w n (as in Chapter 6), is equal to the small area indicated in Figure 
9.12, as is the difference in fitness between AA' and A'A', denoted w l2 - 
xu 12 . The areas corresponding to w n - w n and W\ 2 - w 22 are approximately 
rectangles, and the area of a rectangle is the product of the base and the 
height. The approximation is most accurate when the effect of this one 
locus on the phenotype is small. Therefore, since Z represents the height 
of the normal distribution at the point T, we can make the following 

w n - w n - Z\(T -d)-(J~ a)} = Z(a - d) 
iv n - w u ~ Z[(T + a) - (T - d)] = Z(a + d) 


The average fitness w of the entire population simply equals B, because B is 
the proportion of the population saved for breeding. From Chapter 6 we 
know that 

Ap = pq[p{w n - w n ) + q{w n - rofe)|/ w 

where Ap is the change in frequency of the allele A in one generation of selec- 
tion. Substituting from Equation 9.13 and using w=B leads to 

or, since p + q = 1, 

Ap = pq[pZ(a -d) + qZ(a + d)]/B 

Ap = (Z/B)pq[a + (q-p)d] 



An equation corresponding to 9.15 could be obtained for any gene affect- 
ing the trait, but the values of p, a, and d would differ for each gene. The 
quantity in square brackets in Equation 9.15 is called the average excess. A 
generalization that accounts for nonrandom mating is found in Falconer 

Quantitative Genetics 423 

Distributions in A'A', 
to coincide 

s Distn bn turn in entire population 

*C/Area=-te 12 -N> n 

T-a T-d r + rt 

Figure 9.1 2 Same distribution as in Figures 9.10 and 9.11, but with the distrib- 
utions of AA, AA', and A'A' shifted laterally to coincide. Shifting the distribu- 
tions slides the truncation points slightly out of register, so the truncation points 
for AA, AA', and A'A' become T-a, T- d, and T + n, respectively. The small area 
that is denoted w n - w u is the difference between the proportions of AA and 
AA' genotypes that are included among the selected parents, and the area 
Wn - W22 is the difference in the proportion of AA' and A'A' genotypes included 
among the selected parents. 

Genetic Model for the Change In Mean Phenotype 

Equation 9.15 provides an expression for Ap which can be used to calculate 
the mean phenotypic value of coat color after one generation of selection. In 
the next generation, the allele frequencies of A and A' are p + Ay and q - Ap, 
respectively. With random mating, the mean phenotype in this generation is 
given by Equation 9.12 as 

|i' = (p + Ap) 2 (u* f a) 

+ 2(p + Ap)(q-Ap)(v* + d) 

+ 07-Ap)V-«)- 


When the right-hand side of this expression is multiplied out and terms in 
(Ap) 2 are ignored because Ap is usually small, then u' is found to be approx- 

\x'~\\ + 2[a + {q-p)d\Ap 


The approximation in Equation 9.17 is rather good even for relatively large 
values of Ap. 

424 Chapter 9 

Equation 9 17 warrants a little more development since it yields the pre- 
diction equation R = h 2 S (Equation 9.10) and also provides an expression for 
h 2 in terms of the parameters n, rf, and p that can be interpreted genetically. 
First, rewrite Equation 9.17 as 




p'-)i = 2la + {q-p)d]Ap 
Then substitute for Ap from Equation 9.15, which yields 

li'-p = (Z/B)2p# + ft-p)rf] 2 

Now use the expression for Z/0 given in Equation 9.11 to obtain 

u' - u = (ps ~ M)2p# + {q - p)d\ 2 /a 2 

Finally, substitute from Equations 9.8 and 9.9 for the selection differential S 
and the response R r yielding 

R = (S)2pq[a + {q-p)ci] 2 /o 2 9.21 

However, R = h 2 S also (Equation 9.10), and so 

h z = 2pq[a + (q-p)d] 2 /<5 i 9.22 

Equation 9.22 for h 2 is the one we were after, as it defines the heritability in 
terms of p, q, a, and d — each of which has a genetic meaning. 

Equation 9.22 is a valid approximation when a single gene affects the trait 
in question, and when the effects of that gene are small. However, when 
many genes affect the trait, the right-hand side of the equation must be 
replaced by a summation of such terms, one for each gene. That is, for many 
genes, R = h 2 S where 

h 2 = Z2pq[a + (q-p)d] 2 /a 2 9.23 

in which the summation is over all genes that affect the trait. (However, each 
gene may have different values of a, d, p, and q.) As will be discussed in more 
detail later, the quantity 

V 2 = 12pq[a + (q-p)df 9.24 

is called the additive genetic variance of the trait. Although the individual 
components in the additive genetic variance are difficult to identify except in 
contrived examples like the one involving guinea pigs, the collective effects 
(represented by the summation) can be estimated. 


As Equation 9.24 suggests, the variance of a quantitative trait can be split 
into various components representing different causes of variation. 
Similarity between relatives is conveniently expressed in terms of the vari- 

Quantitative Genetics 425 

ance components, but variance partitioning is also of interest in its own right. 
Since the rate of change of a trait under selection depends on the amount of 
genetic variation affecting the trait, if there is no genetic variation, there is 
obviously no response to selection. What is not so obvious is that some com- 
ponents of genetic variation cannot be acted upon by some kinds of selec- 
tion. In other words, certain populations have ample genetic variation, yet 
fail to respond to selection. The part of the genetic variation amenable to 
selection is clarified by partitioning the variance. 

Genetic and Environmental Sources of Variation 

As shown in Table 9.3, the phenotypic value of any individual can be repre- 
sented as a sum of three components: (1) the mean u of the entire population, 
(2) a deviation from the population mean due to the specific genotype of the 
individual in question (symbolized as Gj, G 2 , and G 3 for AA, AA', and A' A' 
genotypes, respectively), and (3) a deviation from the population mean due 
to the specific microenvironment of the individual in question. (The envi- 
ronmental deviations are unique to each individual and are represented as 
E t , E 2 , . . . , E<).) These microenvironmental effects might be due to random 
differences in nutrition, temperature, or other external factors, or they might 
be seen even in an absolutely uniform external environment due to the 
vagaries of embryonic development. It is important to note that the Gs and 
Es are not directly observable. Nevertheless, as we shall see, the total vari- 
ance in phenotypic value can be partitioned into a component due to varia- 
tion among the Gs and another component due to variation among the Es. 
The model can be summarized by writing 





Phenotypic Value 


A A' 

H + G,+£, 
u + C\ + E 2 
u + G, + £ 3 
u + G 2 + E 4 
u + G 2 + E 5 
P *■ G 2 + E 6 
u f G-, + E 7 
p + Gj + E R 
u + G 3 + £, 

" u is the population mean G is a contribution due to genotype, 
different for each genotype. E is- a contribution due to environ- 
ment, different for each individual 

426 Chapter 9 

where P represents the phenotypic value of any individual and G and E are 
the genotypic and environmental deviations pertaining to that individual. 

To connect the above symbols with actual numbers, we may use Table 9.2 
and assume an allele frequency of A of p = 0.2. Equation 9.12 then implies 
that the mean of the population is u = 0.994. Thus, the respective G lf G 2 , and 
G 3 deviations for AA, AA' r and A' A' genotypes are 

d = 1.202 -0.994 = 0.208 
G 2 = 1.059 -0.994 = 0.065 
G 3 = 0.948 -0.994 = -0.046 

For a particular animal of genotype AA whose actual coat color score is, for 
example, 1.312, the corresponding value of E for the animal would be calcu- 
lated using Equation 9.25 from the expression 1 .3 12 = 0.994 + 0.208 + E; thus, 
for this animal, E = 0.11. Similarly, a particular animal of genotype AA' with 
an actual phenotype of P = 1.009 would have a value of E given by 1.009 = 
0.994 + 0.065 + E, or E = -0.05. Because the E values are defined as deviations 
from their mean, the average of Es for any genotype is 0. Likewise, since the 
Gs are defined as deviations from their mean, the mean of the Gs is 0. This 
result can be verified in the guinea pig example because 

(0.2) 2 G, + 2(0.2)(0.8)G 2 + (0 8) 2 G 3 = 

Equation 9.25 is appropriate when the effects of genotype and environment 
are additive — that is, when the deviation of the phenotype of any particular 
individual from the population mean (P - u) can be written as the sum of an 
effect resulting from the genotype of thai individual and a separate effect 
resulting from the environment of that individual. 

PROBLEM 9 A In Prcblern 9.3 the values of u, a, and d wefe found to 
be 0.31, 0.67, and 0.02, respectively, for the logarithms of tomato 
weight Calculate the additive genetic variance in the F 2 population, 
the Bj population and the Bj population. 

ANSWER In the F 2 population the allele frequency ii p = if = Va, so 
the formula for; the additive genetic variance (Equation 9.24) is o£ » 
2pqa 2 = V 2 fl 2 = 0.224. In the backcross 1 population Bi, the allele 

Quantitative Genetics 427 

frequencies are p - 3 / 4 and <j = Vi, so, applying Equation 9.24, we get a; 
= 0.173. The backcross 2 population B 2 has allele frequencies 
p = y 4 and q = 3 4 so o* = 0.163. When the dominance parameter is so 
small, the additive variance is at a maximum when the allele frequen- 
cies are both l / 2 , and the graph of additive variance against allele fre- 
quency is symmetric. 

To this point the discussion has been restricted to a particular population 
in a single maeroenvironment, and the sources of variation have been due to 
genetic and microenvironmental differences among individuals. A change 
in maeroenvironment is easiest to see in an experimental setting, where, for 
example, in maeroenvironment 1 all of the guinea pigs get twice as much 
food as in maeroenvironment 2. Additivity of genetic and environmental 
effects is true whenever the ratio of G\ : G 2 : G.i is the same in each of the rele- 
vant environments. For the genotypes in Figure 9.13, for example, if the actu- 
al range of environments is the range designated E 1( then the genetic and 
environmental effects are additive because the ratio G^.Gi: G 3 is the same for 

Environment -► 

Figure 9.1 3 The norm of reaction is the relation between the phenotype and the 
environment, and this relation is known to vary from genotype to genotype 
Hypothetical norms of reaction for genotypes AA, AA' and A' A' are shown here. 
In the range of environments denoted E,, A is very nearly dominant to /V(that 
is, AA and AA' have nearly the same phenotype). However, in the range E 2 , A 
and A' are very nearly additive (no dominance). The heritability of the trait 
resulting from this gene differs according to whether the population is reared in 
E, environments or E 2 environments. 

428 Chapter 9 

any particular environment in £j. For the same reason, the genetic and envi- 
ronmental effects are additive if the actual range of environments is E 2 . 

However, if the actual range of environments includes both Ej and £ 2 , 
then the ratio Gi:G2: G 3 depends on the particular environment, and there- 
fore the genetic and environmental effects may not be additive. Nonadditiv- 
ity of genetic and environmental effects is called genotype-environment 
interaction, and in writing Equation 9.25, it appears that there is an 
assumption that there is no genotype-environment interaction. In actually 
estimating components of variance, it is not necessary to assume that there 
is no genotype-environment interaction, because we can explicitly exam- 
ine the phenotypes when reared in different macroenvironments and 
directly estimate the magnitude of the interaction. Alternatively, we can 
arbitrarily define the environmental variance as also including the effects 
of genotype — environment interaction. 

When Equation 9.25 is valid, the total phenotypic variance <?£ in the 
population equals the mean of (P - u) 2 . However, Equation 9.25 implies that 
(P - u) 2 equals (u + G + E - u} 2 , which is 

o 2 = (G + E) 2 = G 2 + 2GE + E 2 


Because G and E are already deviations from their means, the mean of G 2 
is the phenotypic variance in the population resulting from differences in 
genotype, and the mean of E 2 is the phenotypic variance resulting from dif- 
ferences in environment. The mean of G 2 is called the genotypk variance 
and is denoted o£. The mean of E 2 is called the environmental variance and 
is denoted a 2 . The remaining term — the mean of 2GE — is two times the 
genotype-environment covariance. If the genotypic and environmental 
deviations are uncorrelated — that is, if there is no systematic association 
between genotype and environment — then there is said to be no 
genotype-environment association and the mean of 2GE equals zero. When 
there is no genotype-environment association, therefore, 

o 2 . = a 2 + o 2 


Equation 9.27 is the theoretical foundation for partitioning the variance 
into genetic and environmental effects. The assumption that genotype-envi- 
ronment association is negligible is frequently a valid assumption in animal 
and plant breeding where, because breeders have a degree of control not 
available to, for example, human geneticists, experiments can be intentional- 
ly designed in such a way as to minimize genotype-environment association. 
However, genotype-environment association can occur even in animal and 
plant breeding. For example, dairy farmers routinely provide more feed sup- 
plements to cows that produce more milk; because milk-producing ability is 
partly due to genotype, this feed regimen will provide superior environ- 

Quantitative Genetics 429 

ments (better feed) to cows that have superior genotypes to begin with, so 
there will be a genotype-environment association. Similarly the best race 
horses get the best trainers and the children of the best students often go to 
the best schools. If one is not careful to correct for such associations, geno- 
type-environment association can inflate the apparent o 2 and possibly give 
spurious overestimates of heritability. 

The biological meaning of Equation 9.27 is shown for the alleles of one 
gene in Figure 9.14. The solid curves represent the phenotypic distributions 
in the genotypes AA, AA', and A' A' with means denoted G,, G 2 , and G 3 , and 
the dashed curve represents the phenotypic distribution in the entire popula- 
tion. The total phenotypic variance a 2 is the variance of the dashed distribu- 
tion; the genotypic variance o 2 is the variance among the Gs {i.e., o 2 . = p 2 G 2 + 
2pqG\ + q 2 Gl, where p is the allele frequency ol A); and the environmental 
variance a, 2 is obtained by subtraction: a 2 = o 2 - a 2 . Although the Gs are not 
generally known, a 2 must equal zero in a genetically uniform population. 
The observed variance of a randomly bred population, therefore, provides an 
estimate of a 2 + o 2 , whereas the average observed variance of genetically uni- 
form populations provides an estimate of of. The estimate of a 2 is obtained 
by subtraction, as shown in an example using thorax length in Drosophila 
(Table 9.4). In this case, genetic variation among individuals in the randomly 
bred population accounts for about 0.180/0.366 = 49.2% of the phenotypic 
variance. Genetically uniform populations such as inbred lines or crosses 

Figure 9.14 Phenotypic distribution (dashed curve) of a quantitative trait in a 
hypothetical population, showing distributions (solid curves) of three con- 
stituent genotypes for two alleles of a gene. The means of AA, AA', and A 'A' 
genotypes are denoted G h G 2 , and G> respectively. 

430 Chapter 9 








o, 2 = 0.186 

oj = (o| + a 2 } - a? = 0.366 - 0.186 = 0.180 

oj + a; 


Source- Data from Robertson 1957. 

" Trait is length of thorax in Diosophila meltmogaster (in units of 10" 2 mm). 

between inbreds are not available in human populations, but identical twins 
are often used instead because of the identical genotypes of the twins. 

An example of a naturally occurring organism that exhibits remarkably 
low levels of genetic variability is the African cheetah (O'Brien et al. 1983, 
1987; May 1995). One might suppose that limited genetic variation would 
result in depressed phenotypic variability as well, but a study of cranial mea- 
sures by Wayne et al. (1986) revealed that the amount of variability was not 
appreciably less than that in three other large cats. In fact, there was a signif- 
icant increase in the amount of fluctuating asymmetry (that is, the difference 
in measurements from the left and the right side of the body). The fluctuating 
asymmetry is consistent with the notion that genetic homozygosity results in 
reduced developmental stability — an idea that has considerable empirical 
support, but so far no good explanation in molecular terms. In any event, 
reduction of genetic variance, and concomitant high homozygosity, can result 
in an increase in phenotypic variance due to developmental instability. Since 
extreme homozygosity may result in phenotypes that are very sensitive to 
environmental fluctuations, the paradoxical increase in phenotypic variance 
results from genotype-environment interaction. 

Components of Cenotypk Variation 

So far, the phenotypic variance has been partitioned into the genotypic 
variance and the environmental variance according to the Equation 9.27. 
The genotypic variance can be partitioned further into terms that are par- 
ticularly important for interpreting the resemblance between relatives. 
The appropriate model is shown in Table 9.5, where the phenotypic means 
of A A, AA', and A' A' are denoted 1 by u* + a, p* + d t and u* - a, as they 
were earlier in Figure 9.11. To obtain the G values, the mean of each geno- 
type must be expressed as a deviation from the population mean, which is 


Quantitative Genetics 431 




Mean phenotype 

Genotypic deviation from 
population mean (G) 


r 2 

}i* +rt 



|i* + d 

A' A' 


]i* -a 

C, = n* + it - // = 2q[a + (q ~p)d] - 2q 2 d 
G 2 = p* + d - ft = (q - p)\a + (17 - p)d] + 2pqd 
C, = v* - a - \i = -2p\a + (q- p)d] - 2q 2 d 

Population mean p = pVu* + a) + 2pq{]i* + cf) + q 2 (p* -a) 
= (p 2 + 2j*j + q 2 )v* + (p 7 - <i 2 )n 4 2pqd 
= (p + q 2 )\i* + [p - q){p + q)a + 2pqd 
= u* + {p-q)a+2pqd 

u = p* + (p - q)a + 2pqd, and the deviations are shown in the last column 
of Table 9.5. The genotypic variance o 2 is calculated as 


o^ = p 2 Gf + 2p£jG 2 2 + ^ 

= 2pq[a+(q-p)df + (2pqdf 

The first term in Equation 9.28 is the additive genetic variance c 2 encoun- 
tered earlier in Equation 9.24. The second term is a new quantity called the 
dominance variance, which is symbolized cj. From Equation 9.28, there- 

a * = of + aj 


which allows us to express the total phenotypic variance as the sum of three 
terms, namely 

_2 _ _2 , _2 , _2 


When Equation 9.22 for heritability is written in terms of variance compo- 
nents rather than p, q, a and d, the equation implies that 

ft 2 = <J,Vo ( ? 


Equation 9.31 is an important result because it states that the heritability 
depends only on the additive genetic variance and not on the dominance 
variance. Therefore, if all the genetic variance in a population results from 
dominance variance (i. e., o\ = 0), then the population cannot respond to indi- 
vidual selection because h 2 equals zero. To say the same thing in another way, 
the dominance variance oj represents that portion of the genetic variance that 
is not acted upon by individual selection. 

Equation 9.31 means that the heritability of a trait is the ratio of the addi- 
tive genetic variance to the total phenotypic variance. Sometimes the word 

432 Chapter 9 

{irritability is used in reference to a different variance ratio, namely the ratio of 
the total genotypic variance to the total phcnotypic variance (i.e , <J v ?/o*)- To 
avoid confusion, quantitative geneticists distinguish the two types of heri- 
tability as follows: 

1. The ratio G*/af, is called heritability in the narrow sense (This is the vari- 
ance ratio we have been using all along.) 

2. The ratio o^/af, is called heritability in the broad sense. 

Generally speaking, narrow-sense heritability is the more important with 
individual selection (or any mode of selection that capitalizes primarily on 
the additive genetic variance), whereas broad-sense heritability is the more 
important when selection is practiced among clones (a clone is a group of 
genetically identical individuals), inbred lines, or varieties. We use the term 
heritability to mean narrow-sense heritability unless otherwise stated. 

As emphasized earlier, heritability has no transparent interpretation in 
simple genetic terms. The same is true of the variance components o* and o^. 
Even for a single gene, the variance components depend on the particular 
values of a and d (Figure 9.15), and of course the estimates of heritability 
must also depend on allele frequency (Figure 9.16). With many genes that act 
together, a* is defined as a summation of the values of 2pq[a + {q - p)df for 
each gene affecting the trait, and o^ represents a summation of the values of 
{2pqd) 2 for each gene. Furthermore, when the trait is affected by multiple 
genes, the formula for aj in Equation 9.29 must be extended to include an 
additional term that pertains to interaction among the genes. This interaction 
term is called the interaction variance or the epistatic variance and is sym- 
bolized of. With the interaction variance included, Equation 9.29 becomes 

7 222 
0; = o; + a 5 + 0/ 


The important point to remember about the components of genotypic 
variance is that they represent the cumulative, statistical effects of all genes 
affecting the trait. Few inferences about the actual mode of inheritance of the 
trait are possible from the variance components, particularly concerning the 
number of genes involved and their individual effects. 

PROBLEM 93 By definition, a simple IVfendetian trait is one that is 
determined entirely by genotype in the prevailing environment. 
Therefore, 0? ■ in Equation 9.27, and the broad-sense heritabiliry 
Qg/vj - 1. Show that, for a simple Mendelian recessive, the 
narrow-sense heritability equals 2qr/(l + q), where <? is the recessive 
allele frequency. 

Quantitative Genetics 433 

a 2 and a}, 

2 4 6 0.8 10 
Allele frequency of A (p) 

0.2 0.4 6 0.8 1.0 
Allele frequency of A (p) 

0.2 0.4 6 OS 1.0 

Allele frequency of A (p) 

0.2 0.4 6 8 1.0 
Allele frequency of A (p) 

Figure 9. 1 S Total genetic variance (oj), additive genetic variance (of), 
and dominance variance (ojf) for a locus with two alleles {A and A') plotted 
against the frequency of allele A (p). The mean phenotypes of AA, AA' and A' A' 
are denoted u* + a p* + d, and u* - a respectively. In all cases, we have that aj = 

Wm {q ~Z\ d] J ° rf = {2 W d)r and °* = °« + ^- (A) « = d = 0.0701 (A dominant to 
A y, (W) a = 0.1, d = (no dominance), (C) d = -a = 0.0707 (A' dominant to A); (D) 
a - 0, d = 0.141 (overdominance). For ease of comparison, the values of a and d 
have been chosen to make the maximum of oj equal to 0.005 in each case 

ANSWER Let the phenotypes of AA, AX, and A' A' be assigned phe- 
notypic values 0, 0, and 1, respectively, so that the A' allele is recessive. 
In this case, *i* * 1/2, a « -1/2, and d = -1/2. The numerator of Equa- 

434 Chapter 9 


0.1 2 

0.3 04 05 06 07 
Allele frequency of A (p) 

0.9 1.0 

Figure 9.1 6 Narrow-sense heritability due to a single locus with two alleles (A 
and A'} as a function of p, the allele frequency of A. In general, for one locus, 
h 1 = 2pq[a + {c\~ p)d] 2 /aj, where of, is the total phenotypic variance. The curves 
correspond to a = 0.1, and d = 0.1 (A dominant), d - (no dominance), and d = 
-0.1 {A' dominant). 

Hon 9.31 is the additive fcenetfc variance, which equate Zptf*. Then the 
mean phenotype equals q 2 , and the variance in phenotypic value is 
q 2 - (ff * tftl -«j a > * q\\ + q){\ - f) » p^U + -?). The heritability is the 
additive variance divided by the phenotypic variance, namely 
2p<? 3 /0V(l + q)\ - 2*f /(I + <?)- When the autosomal recessive trait is 
rare, q - 0, and the heritability is approximately equal to the frequency 
of heterozygous carriers. 


Components of genetic variation are important because they may be used to 
express the phenotypic covariance between relatives. Since the distribution 
of offspring from a given parental genotype depends on the distribution of 

Quantitative Genetics 435 

potential mates, the variance components and estimates of heritability 
depend not only on allele frequencies but also on the distribution of geno- 
type frequencies. To simplify things, we assume that Ihe trait is determined 
by one gene with two atleles, and that the population is in Hardy-Weinberg 
proportions. However, the same results are also true for many genes when 
the trait is determined by summing the individual allelic effects provided 
that the population is in multilocus linkage equilibrium. 

Table 9.6 displays three genotypes of parents, their genotypic value, and 
the mean genotypic value of the offspring with random mating. The covari- 
ance of the offspring and one parent is calculated by summing the product of 
the last three columns of Table 9.6 and subtracting the product of the means 
of the last two columns. After tedious algebra, the covariance of offspring and 
parent is: 

C OP = pq[a+(q- p)d\ 2 = >/ 2 <sj 


This is a remarkably simple result because it says that the covariance 
in phenotype of parents and offspring is one-half the additive genetic vari- 
ance. No component of dominance inflates the covariance in this case. 
Since environmental effects are assumed to be random with respect to 
genotypic values (there is no genotype-environment correlation), environ- 
mental effects also play no role in parent-offspring covariance. We must, 
however, assume that the environments of parents and offspring are uncor- 
rected for Equation 9.33 to be valid. In order to see the relation between 
narrow-sense heritability and regression, recall that the regression coeffi- 
cient is defined as 

b = GxJal 


In this case, the regression of offspring on one parent is, from Equation 9.33: 
b p = Oor/op 

= V 2 a} /of, 9.35 


Parent's genotype Frequency Genotypic value 

Offspring mean 
genotypic value 



n{q -p)+d{\- 2pq) 

-2p(a + qd) 

aq + 'dq{q - /») 



' Genotypic values are expressed as deviations from the population mean. 

436 Chapter 9 

In the regression of offspring on the mnlparent ( is, the average value of 
the parents), the denominator in Equation 9 35 becomes Vz^r because this is 
the variance of the mean of the two parents, assuming random mating 
Hence the regression coefficient equals (^c.lVOAo^), and so Ihe regression 
coefficient of offspring on midparent equals the narrow-sense heritability. 

The same reasoning can be followed to obtain covariances between other 
pairs of relatives, as summarized in Table 9.7. As can be seen, the additive 
genetic variance can be estimated directly either from parent-offspring 
covariance or from half-sib covariance. However, full-sib covariance includes 
a term resulting from dominance. The expressions in Table 9.7 are correct as 
long as there are no complications such as genotype-environment associa- 
tions or other nonrandom environmental effects such as full sibs sharing 
environmental factors common to the whole family but not shared by other 
families. Since the tolal variance in phenotypic value of, can be estimated 
directly, once u„ is estimated from the covariance between relatives, the 
narrow-sense heritability can be estimated from Equation 9.31 . The first three 
relationships in Table 9.7 are the most useful in quantitative genetics and are 
commonly used in animal and plant breeding. The other relationships are 
used mainly in human quantitative genetics. 

The genetic covariance between various relatives can also be derived 
using the concepts of gene identity developed by Cotterman (1940) and 
extended by Crow and Kimura (1970). In these terms, the generalized covari- 
ance for a pair of related individuals is 

Cov(x,y) = r On + m Orf 


where the coefficients r and w are determined from coefficients of coancestry. 
The coefficient of coancestry, F w , of two individuals x and y is the inbreeding 

TABLE 9.7 


Degree of relationship 


Offspring and one parent 

Offspring and average of parents (midparent) 

Half siblings 

Full siblings 

Monozygotic twins 

Nephew and uncle 

First cousins'' 

Double first cousins 

2 J4 
(a}/2) + (CT d 2 /4) 

(a„ 2 /4) + (tf rf 7l6) 

" Variance terms due to interaction between loci (epistasis) have been ignored. 

First cousins are the offspring of matmgs between siblings and unrelated individuals; double 
first cousins are the offspring of matings between siblings from two different Families. 

Quantitative Genetics 437 

coefficient of a hypothetical offspring of „v and y If individuals A and B are 
the parents of r, and C and D are the parents of i/, then the / and u coefficients 
in Equation 9.36 are: 

r = 2F„, 


It is an instructive exercise to write out pedigrees for some of the relations in 
Table 9.7, calculate the coefficients of coancestry, and verify that Equation 
9.36 gives the correct result. 

Figure 9.17 presents the narrow-sense heritabilities of diverse quantitative 
traits in farm animals and one important crop plant as estimated from the cor- 
relation between relatives. The data are presented merely to show the values 
of heritability with which breeders typically must deal. It is important to keep 
in mind that the heritabilities in Figure 9.17 pertain to one population in one 
type of environment at one particular time. The same trait in a different pop- 
ulation or in a different environment might well have a different heritability. 
Generally speaking, traits that are closely related to fitness (such as calving 
interval in cattle or eggs per hen in poultry) tend to have rather low heritabil- 
ities. Ignoring complications such as antagonistic pleiotropy (discussed later), 
long-term natural selection is expected to gradually reduce the additive genet- 
ic variance until the effect is balanced against the input of new mutations. 

For purposes of comparison, Figure 9.18 shows estimated broad-sense 
heritabilities of a number of quantitative traits in humans Broad-sense heri- 
tabilities vary widely for different traits, as they do in other species. Note the 
low heritability of fertility, a trait that is obviously closely related to fitness. 
At the other end of the scale is total fingerprint ridge count, which is appar- 
ently not a major component of fitness considering its relatively high broad- 
sense heritability. 

Although it is tempting to think about resemblance between relatives in 
terms of classical genetic analysis such as Mendel did, there are major differ- 
ences between the approaches. When the data are measurements of a contin- 
uous character in family-structured samples from a population, estimates of 
statistical components of variance can be obtained. However, these compo- 
nents depend on allele frequencies and environmental conditions, and thus 
quantities such as heritability are far removed from the basic level of gene 
action. Direct experimental assessment of variance components (e.g., 
Mitchell-Olds 1986) shows that different populations do have different heri- 
tabilities of many traits. Moreover, a trait with high heritability does not 
mean that the trait cannot be affected by the environment. For example, 
phenylketonuria, which is caused by homozygosity for a single defective 
allele for the enzyme phenylalanine hydroxylase, is a simple Mendelian 
recessive, yet the phenotype of severe mental retardation can be completely 
circumvented by a diet low in phenylalanine. 

438 Chapter 9 



o 0.50 









■ Milk % 




Poultry Swine 




Clean fleece 

weight ^ Albumen 

2=- -as* 


Eggs per 
hen housed 









. Plant 


. Ear 

• Yield 


Figure 9. 1 7 Narrow-sense heritabilities for representative traits in plants and 
animals. Traits closely related to fitness (calving interval, eggs per hen, litter 
size of swine, yield and ear number of corn) tend to have rather low heritabili- 
ties. (Animal data from Pirchner 1969, who gives the range of heritabilities in 
various studies. The midpoint of the range is plotted here. Corn data from 
Robinson etal. 1949.) 

PROBLEM 9.6 Consider the following hypothetical experiment on 
plant growth. Seeds were removed from six plants and offspring were 
grown in two different light conditions. Height of the plant at eight 
weeks was measured in the parental plants and all the progeny, 
giving the following data: 

Quantitative Genetics 439 

Offspring with 

Offspring with 10% 


full sunlight 

full sunlight 



















Calculate the heritability in both environments. 

ANSWER The variance In the midparents can be calculated as 
(fir/ - |[fXXi) a /6]}/5 ■ 0.0391. The covairiance between midparents and 
offspring at full sunlight is \&x® ( - rj£x,X.y;)/6j)/5 = 0.0365, so the nar- 
row sense heritability te 0.0365/.0391 = 0.93. At 10% full sunlight, the 
tnidparent-offspjing covaiiance is 0.0353, and the heritability is 
0.0353/0.0391 ** 0.90. This example illustrates the important principle 
that a trait may have a very high heritability, yet the phenotypic mean 
Is sfeUl strongly altered by a change in the environment. Actually, the 
heritability itself may also change as one moves from one environ- 
ment to another. 

The distinction between estimation of variance components and knowl- 
edge of genetic causes of human differences has pa rticularly important impli- 


t- 5 v 

V <" N 

ll 1 


IS s,e 


5 Cl 

4-.i I 1. 1 








Figure 9.18 Broad-sense heritabilities and ranges of heritabilities of various 
traits in humans. Uncertainties about the correlation between environments of 
relatives make such estimates in humans very tentative. (Data from Smith 1 975 ) 

440 Chapter 9 

cations for social applicability of human quantitative genetics. Some of the 
problems are forcefully conveyed by Lewontin (1974) and Feldman and 
Lewontin (1975). In particular, estimates of heritability within a population, 
even if they are sound, tell us nothing about the degree to which genetic dif- 
ferences account for the differences in phenotypes between populations (see 
Problem 9.6). Experimentalists working with organisms that can be manipu- 
lated in the field or laboratory can be more rigorous in the assessment of 
genetic parameters by examining the traits in several environments, and by 
doing studies that combine classical and quantitative genetic analysis. 

Twin Studies and inferences of Heritability in Humans 

Because identical twins are genetically identical, phenotypic differences 
between identical twins would seem to be a straightforward measure of how 
much phenotypic variance is caused by environment. Twin studies raise 
their own unique problems, however, and the results must be interpreted 
with caution. Before discussing the use of twins in quantitative genetics, we 
should back up a few steps and first discuss the phenomenon of twinning 

Twins are relatively frequent among human births, though the rate of 
twinning varies from population to population. Among Caucasians in the 
United States, for example, about one in 88 births results in twins; among 
Japanese in Japan, the rate is about one in 145 births (Bulmer 1970). Two 
kinds of twins actually occur. Identical twins, often called monozygotic or 
one-egg twins, arise from a single zygote that very early in embryonic devel- 
opment splits into two distinct clumps of cells, each clump thereafter under- 
going its own embryonic development. Because they arise from a single 
zygote, identical twins are necessarily genetically identical. The other kind of 
twins are called fraternal twins, dizygotic twins, or two-egg twins, Fraternal 
twins arise from a double ovulation in the mother, each egg being fertilized 
by a different sperm. Because of their mode of origin, fraternal twins are 
related genetically as siblings. Most of the variation in twinning rates in 
humans is due to variation in the rate of dizygotic twinning. For example, the 
rates of monozygotic twinning among Caucasians in the United States and 
among Japanese in Japan are one in 256 and one in 238, respectively, whereas 
the respective rates of dizygotic twinning are one in 135 and one in 370 (Bul- 
mer 1970). 

For studies in quantitative genetics, identical twins are often compared 
with same-sex fraternal twins in order to discount the effects of common 
intrauterine environments. Such an approach is only partially successful, as 
identical twins often share embryonic membranes in utero (the amnion and 
chorion) that are not usually shared by fraternal twins. Moreover, because 
identical twins often have astonishingly similar facial features, they may be 

Quantitative Genetics 441 

treated more similarly by parents, teachers, and peers than are fraternal 
twins. Some of these problems can be overcome by studying twins that are 
raised apart (in different households), but data of this sort are usually limited 
(Shields 1962), making estimates of heritability highly imprecise. Even when 
twins are reared apart, the environments into which they are adopted are 
generally similar. This effect of correlated environments has the effect of 
inflating the apparent degree to which traits are genetically determined. In 
any case, if r M7j and r oz represent the correlation coefficients of a quantitative 
trait among monozygotic and dizygotic twins, then 2(r M7 - r DZ ) provides a 
rough estimate of the broad-sense heritability of the trait. To see where this 
formula comes from, first look at Table 9.7. The covariance of monozygotic 
twins in the absence of environmental correlation is a„ 2 + aj, so the correla- 
tion between monozygotic twins is this covariance divided by the phenotyp- 
ic variance, or the broad-sense heritability. If monozygotic and dizygotic 
twins have the same degree of environmental correlation, then subtracting 
the correlation of one from the other should remove the environmental cor- 
relation. The correlation between dizygotic twins is the same as that of full 
sibs, or V 2 af + >/ 2 o^. Assuming that the phenotypic variance is the same in 
both types of twins, the expression 2(r M2 - r DZ ) is equal to [a, 2 + V 2 oi]/o 2 , 
which is not exactly equal to the broad-sense heritability, but it is an approx- 
imation (Smith 1975). Even when the mathematically precise estimators are 
used, the problem of shared environments does not go away. 

One human trait that has received an inordinate amount of attention is 
intelligence. The estimation of heritability of intelligence from twin studies is 
steeped in controversy. Aside from the necessity to define intelligence as per- 
formance on an IQ test, these studies face enormous hurdles in obtaining 
accurate assessments of causes of patterns of similarity. Ever since the data of 
Cyril Burt showing a heritability of IQ of 0.771 were cast in doubt, there have 
been efforts to revive claims of a very high heritability of IQ. Often the lan- 
guage is imprecise, with claims like "about 70% of the variance in IQ was 
found to be associated with genetic variation" (Bouchard et al. 1990). Even 
with the best care taken to study only adopted twins reared apart, the prob- 
lem of correlated environments makes it impossible to obtain an entirely reli- 
able estimate. An important point that is frequently overlooked in this 
discussion is that a high heritability implies nothing about the ability to 
change a trait by modification of the environment. The question has been 
raised whether there is any societal good to be gained from knowledge of 
heritability of IQ; the problem makes for lively reading on both sides (Lewon- 
tin et al. 1984; Herrnstein and Murray 1996). Fortunately for evolutionary 
biologists, organisms that can be reared in controlled conditions do afford the 
opportunity to obtain meaningful estimates of heritability and other compo- 
nents of quantitative genetic variation, as described in the next section. 

442 Chapter 9 


Population biologists typically estimate the heritability and genetic correla- 
tions of a trait for the purpose of addressing the genetic constraints on the 
evolution of the trait. The best experimental approach depends on whether 
the organisms are sampled directly from a natural population, whether a 
series of inbred lines is available, and whether laboratory rearing is practical. 
Because the components of variance (including heritability) are descriptions 
of a particular population in a particular environment, the ideal would 
appear to be to use naturally occurring individuals, keeping track of their 
familial relationships, and to fit the statistical models relating degrees of 
relationship to expected covariances. In practice, because the natural envi- 
ronment is so variable, it is often preferable to do analyses in a controlled lab- 
oratory environment, but this introduces other problems outlined below. 

Estimation of genetic variance components in natural populations gener- 
ally requires a number of restrictive assumptions'. (1) diallelic inheritance, (2) 
no correlation of parental and offspring environments, (3) no linkage or link- 
age disequilibrium, (4) parents equally inbred, (5) offspring not inbred, (6) 
samples of relatives drawn at random from a noninbred population, (7) no 
mutation, migration, selection, and (8) random mating (see Mitchell-Olds 
and Rutledge 1986 and references therein). In practice, small violations of 
these assumptions are acceptable, but it is nevertheless a serious challenge to 
obtain reliable estimates of genetic variance components from samples from 
natural populations. Most commonly, a full analysis is not carried out in nat- 
ural populations, but offspring collected from known mothers are studied. 
One can also estimate a lower bound on the heritability in a natural popula- 
tion by regression of measurements of laboratory-reared offspring on the 
measurements of parents sampled from nature (Riska et al. 1989). 

Once measurements are obtained on individuals with known degrees of 
relationship, the partitioning of variance into additive, dominance, and envi- 
ronmental components can be done using the standard statistical method of 
analysis of variance. Many experimental designs permit the effects of various 
factors on quantitative genetic components to be estimated. For example, 
assays of variance components could be repeated under several different 
environmental regimes, and the environmental component in the analysis of 
variance could be estimated. A second generation of organisms could also be 
studied and the variance components estimated from parent-offspring 
covariance. However, the two most common designs are analysis of variance 
of full-sib families or half-sib families. Using full-sib families alone has the 
problem that the covariance of full sibs includes both additive and domi- 
nance variance. Therefore, when the data are limited to full-sib families, all 
that can be estimated is the broad-sense heritability. On the other hand, if one 

Quantitative Genetics 443 

studies only half-sibs (whose covariance is V^), one can estimate the nar- 
row-sense heritability, but the dominance component may be quite large and 
remain undetected. 

More elaborate designs that feature parents, full sibs and half sibs can 
give extensive partitioning of variance components. The reliability of the 
methods can be tested by comparing heritability estimates from either par- 
ent-offspring regression or from the covariance of half sibs. One means of 
using all the data in a single estimate is the method of maximum likelihood, 
a procedure for parameter estimation which solves for parameter values that 
have the maximum likelihood of obtaining the observed data under a given 
model. Maximum likelihood methods are preferable to analysis of variance 
when several traits are being examined to estimate the components of genet- 
ic covariance. Analysis of variance becomes extremely cumbersome when 
sample sizes are not the same at all levels of relationship, and there is no 
single ideal way to adjust for unequal sample sizes. 

The principle behind maximum likelihood is to construct a likelihood 
function that describes the likelihood of obtaining the observed data given 
the family structure and a set of unknown parameters to be estimated. The 
unknown parameters are the magnitudes of the various genetic and environ- 
mental variance components. The method then finds the values of the 
unknown parameters that maximize the likelihood. In practice, the unknown 
parameters form a variance-covariance matrix, and the computer algorithms 
entail extensive matrix manipulations. The computer output consists of esti- 
mates of heritability that utilize all the data, including parent-offspring, full- 
sib, half-sib, and any other relationships that are informative. Although no 
assumption of multivariate normality of the character data is necessary in 
obtaining the estimates, the estimates have meaning (and the prediction 
equation is reliable) only if the phenotypes are normally distributed. Testing 
the statistical significance of the estimates requires normality. Shaw (1987) 
provides a thorough review of the merits of maximum likelihood methods in 
quantitative genetics. 

Heritability tells us virtually nothing about the actual mode of inheritance 
of a quantitative trait, useful as the concept may be in predicting response to 
selection. The heritability of a trait represents the cumulative effect of all 
genes that affect the trait. Even if a trait is determined by a single gene, heri- 
tability depends in a complex manner on the values of p, a, and d, and these 
individual components cannot be disentangled. (The values of p, a, and d are 
said to be statistically confounded.) With more than one gene, the heritabili- 
ty includes a summation of terms for each gene, and each term has its own 
particular values of p, q, a, and rf. Here, precisely, is the problem: for a quanti- 
tative trait determined by, say, 10 diallelic loci, there would be 30 quantifies 
involved in heritability — 10 allele frequencies, 10 values of a, and 10 values of 



444 Chapter 9 

d. Heritability is but a single number that gives the combined effect of all 30 
quantities. It says nothing about any one of them. 

It must be emphasized thai heritability is a quantity that comes from a 
mathematical model of reality, and the model has many assumptions. We 
have assumed that all genes affecting the trait act independently of one 
another and are unlinked. In actual cases, genes often interact and can be 
linked. (The model can, however, be extended to incorporate at least partial- 
ly the effects of linkage and epistasis.) Moreover, the assumption of no corre- 
lation of parental and offspring environments is not always easy to test. All in 
all, while heritability, especially realized heritability, is an indispensable aid 
to plant and animal breeders, it lends itself to no easy interpretation in sim- 
ple genetic terms (apart from the statistical description through parent- 
offspring regression). Another difficulty in interpreting heritability values is 
that they depend on the range of environments that occur. The denominator 
a p in Equation 9.23 is the total variance in phenotypic value in the popula- 
tion. Because the total variance includes the variance resulting from environ- 
mental differences among individuals, increasing the variation in the 
environment decreases h 2 . Exceptionally thoughtful discussions of the con- 
cept of heritability and its strengths and limitations are found in Kempthorne 
(1978) and Jacquard (1983). 

Heritability values are determined in part by gene frequencies. Because 
gene frequencies change during the course of selection, the heritability is also 
expected to change. In practice, however, the heritability changes sufficiently 
slowly that over the course of a few generations, it can be regarded as approx- 
imately constant. The approximate constancy of heritability has a twofold 
cause: (1) if a particular gene accounts for only a small proportion of the total 
phenotypic variance in a quantitative trait, then the gene frequency does not 
change very rapidly, and (2) the values of a and d remain nearly constant pro- 
vided that the environment does not change drastically from one generation 
to the next. Thus, at least for the first 10 generations or so, heritability usual- 
ly remains approximately constant and can be used as a constant in the pre- 
diction equation (Equation 9.10). To be precise, suppose h 2 is constant and let 
u, and S, represent the mean of the population and the selection differential 
in the t th generation. Then, over the length of time during which h 2 is approx- 
imately constant, 

u, -Uo = r(S + S,4 -+S M ) 


The quantity p, - Uo is the total response to selection, and S 4 Si + ■•■ + S w 
is called the cumulative selection differential. During the time in which h 2 is 
approximately constant, therefore, a plot of u ( against cumulative selection 
differential is expected to yield a straight line with slope equal to h 7 , as illus- 
trated for a case in mice in Figure 9.19. 

Quantitative Genetics 445 

2 4 6 

10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 
Cumulative selection differential 

Figure 9.19 Linearity in response against the cumulative selection differential 
for body weight in mice at age six weeks. Linearity in the up (high-weight) 
direction continues for about twice as long as it does in the down (low-weight) 
direction. (After Falconer 1955.) 

Indirect Estimation of the Number of Genes Affecting 
a Quantitative Character 

The number of genes that contribute to quantitative traits is not always large. 
We have already seen an example of seed color in wheat in which the num- 
ber of genes was three. When the number of genes is relatively small, the 
number can often be estimated from the means and variances observed in 
different strains and their hybrids and backcrosses. In the case of two addi- 
tive alleles of each of three genes, when parental strains are homozygous for 
all unfavorable or all favorable alleles, then they differ by six units in pheno- 
typic value. The variance in phenotypic value in the F 2 generation equals 3 / 2 
units 2 . If there are n unlinked, additive genes, the difference D in phenotyp- 
ic value between the means of parental inbred lines is 2n units, and the vari- 
ance a 2 in the F 2 generation is n/2 units 2 . 

In order to obtain an estimate of the number of genes that is independent 
of the units of measurement, a ratio is needed to make the units cancel. One 
possibility, first suggested by Wright (in Castle 1921), is 

. D 2 

&a 2 


With n unlinked, additive genes, h = (2n) 2 /[8(n/2)] = n, as it should. 
Equation 9.39 is based on the assumptions of complete additivity, equal 
effects of all genes, no linkage, and fixed differences between parental lines. 

446 Chapter 9 

When the assumptions are violated, application of the equation usually results 
in estimates of gene number that are smaller than the actual numbers. For this 
reason, the quantity estimated in Equation 9.39 is called the effective number 
of genes because it defines a lower limit to the actual number. Figure 9.20 pre- 
sents the results of a simulation, using a range of 2 to 10 genes to generate sam- 
ple data to which Equation 9.39 was applied to estimate n. The message is that 
this method is very approximate and somewhat biased. The statistical proper- 
ties of the method have been improved (Zeng 1992), but the estimate is still 
only a rough approximation to the number of genes affecting a trait. 

The variance in Equation 9.39 is the genetic variance, which is the variance 
in phenotypic value resulting from genetic differences among individuals. 
When the environment contributes an amount of to the phenotypic variance,, 
then the variance within the parental inbred lines, or within the F, population, 
equals a, 2 . This is because the populations are genetically uniform, and the 
only source of variation in phenotype results from the environment. Howev- 
er, within the Fj generation, variation in phenotype results in part from genet- 
ic variation and in part from environmental variation, and the total variance in 
phenotypic value equals the summation a 2 + a 2 . Therefore, subtraction of the 
F( variance from the F 2 variance gives an estimate of o 2 because 

o- 2 = [a 2 + OY 2 ]-o f 2 


Further discussion of the genetic variance occurs later in this chapter. 
Lande (1981) gives several alternative methods of estimating the genetic vari- 
ance using data from inbreds, hybrids, and backcrosses. Cockerham (1986) 

2 4 6 B 

Number of loci in simulation 

Figure 9.20 Computer-generated samples of an F 2 population having a range 
of numbers of loci with purely additive effects that influence a trait. For each 
sample, the number of genes was estimated following Wright's method. 

Quantitative Genetics 447 

extended the analysis to obtain an unbiased estimation of the difference in 
parental means, and he combined the data from parentals, P,,